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Preface 


This is a book about algorithms for performing arithmetic, and their imple- 
mentation on modern computers. We are concerned with software more than 
hardware — we do not cover computer architecture or the design of computer 
hardware since good books are already available on these topics. Instead, we 
focus on algorithms for efficiently performing arithmetic operations such as 
addition, multiplication, and division, and their connections to topics such 
as modular arithmetic, greatest common divisors, the fast Fourier transform 
(FFT), and the computation of special functions. 


The algorithms that we present are mainly intended for arbitrary-precision 
arithmetic. That is, they are not limited by the computer wordsize of 32 or 64 
bits, only by the memory and time available for the computation. We consider 
both integer and real (floatmg-point) computations. 


The book is divided into four main chapters, plus one short chapter (essen- 
tially an appendix). Chapter 1 covers integer arithmetic. This has, of course, 
been considered in many other books and papers. However, there has been 
much recent progress, inspired in part by the application to public key cryp- 
tography, so most of the published books are now partly out of date or incom- 
plete. Our aim is to present the latest developments in a concise manner. At the 
same time, we provide a self-contained introduction for the reader who is not 
an expelt in the field. 


Chapter 2 is concerned with modular arithmetic and the FFT, and their appli- 
cations to computer arithmetic. We consider different number representations, 
fast algorithms for multiplication, division and exponentiation, and the use of 
the Chin¢se remainder theorem (CRT). 

Chapter 3 covers floating-point arithmetic. Our concern is with high- 
precision floating-point arithmetic, implemented in software if the precision 
provided by the hardware (typically IEEE standard 53-bit significand) is 
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inadequate. The algorithms described in this chapter focus on correct round- 
ing, extending the IEEE standard to arbitrary precision. 

Chapter 4 deals with the computation, to arbitrary precision, of functions 
such as sqrt, exp, In, sin, cos, and more generally functions defined by power 
series or continued fractions. Of course, the computation of special functions is 
a huge topic so we have had to be selective. In particular, we have concentrated 
on methods that are efficient and suitable for arbitrary-precision computations. 

The last chapter contains pointers to implementations, useful web sites, 
mailing lists, and so on. Finally, at the end there is a one-page Summary of 
complexities which should be a useful aide-mémoire. 

The chapters are fairly self-contained, so it is possible to read them out of 
order. For example, Chapter 4 could be read before Chapters 1-3, and Chap- 
ter 5 can be consulted at any time. Some topics, such as Newton’s method, 
appear in different guises in several chapters. Cross-references are given where 
appropriate. 

For details that are omitted, we give pointers in the Notes and references 
sections of each chapter, as well as in the bibliography. We have tried, as far 
as possible, to keep the main text uncluttered by footnotes and references, so 
most references are given in the Notes and references sections. 

The book is intended for anyone interested in the design and implementation 
of efficient algorithms for computer arithmetic, and more generally efficient 
numerical algorithms. We did our best to present algorithms that are ready to 
implement in your favorite language, while keeping a high-level description 
and not getting too involved in low-level or machine-dependent details. An 
alphabetical list of algorithms can be found in the index. 

Although the book is not specifically intended as a textbook, it could be 
used in a graduate course in mathematics or computer science, and for this 
reason, as well as to cover topics that could not be discussed at length in the 
text, we have included exercises at the end of each chapter. The exercises vary 
considerably in difficulty, from easy to small research projects, but we have 
not attempted to assign them a numerical rating. For solutions to the exercises, 
please contact the authors. 

We welcome comments and corrections. Please send them to either of the 
authors. 


Richard Brent and Paul Zimmermann 
Canberra and Nancy 
MCA@rpbrent.com 
Paul.Zimmermann@inria.fr 
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Notation 


set of complex numbers 

set of extended complex numbers C U {oo} 
set of natural numbers (nonnegative integers) 
set of positive integers N\{0} 

set of rational numbers 

set of real numbers 

set of integers 

ring of residues modulo n 


set of (real or complex) functions with n continuous derivatives 


in the region of interest 


real part of a complex number z 
imaginary part of a complex number z 
conjugate of a complex number z 
Euclidean norm of a complex number z, 
or absolute value of a scalar z 


Bernoulli numbers, 7,59 Bn2”/n! = z/(e* — 1) 
scaled Bernoulli numbers, C,, = B2,/(2n)!, 

Yo On 22” = (z/2)/ tanh(z/2) 

tangent numbers, )> 7;,22"—1/(2n — 1)! = tan z 
harmonic number 57, 1/j (Oifn < 0) 


binomial coefficient “n choose k” = n!/(k! (n — k)!) 
(Oifk <0ork >n) 


Xiv 


Notation 


“word” base (usually 2°? or 2°*) or “radix” (floating-point) 
“precision”: number of base ( digits in an integer or ina 
floating-point significand, or a free variable 

“machine precision” G!~" /2 or (in complexity si ee 
an arbitrarily small positive constant 

smallest positive subnormal number 


rounding of real number z in precision n (Definition 3.1) 
for a floating-point number «, one unit in the last place 


time to multiply n-bit integers, or polynomials of 

degree n — 1, depending on the context 

a function f(n) such that f(n)/M(n) > 1asn— oo 
(we sometimes lazily omit the “~” if the meaning is clear) 
time to multiply an m-bit integer by an n-bit integer 

time to divide a 2n-bit integer by an n-bit integer, 

giving quotient and remainder 

time to divide an m-bit integer by an n-bit integer, 

giving quotient and remainder 


a is a divisor of b, that is b = ka for some k € Z 

modular equality, m|(a — b) 

assignment of integer quotient to g (0 < a— qb < b) 
assignment of integer remainder to r (0 < r = a— qb < b) 
greatest common divisor of a and b 


Jacobi symbol (6 odd and positive) 


if and only if 

bitwise and of integers 7 and j, 

or logical and of two Boolean expressions 
bitwise or of integers i and 7, 

or logical or of two Boolean expressions 
bitwise exclusive-or of integers 7 and 7 
integer i multiplied by 2" 

quotient of division of integer i by 2" 


product of scalars a, b 
cyclic convolution of vectors a, b 


2-valuation: largest k such that 2* divides n (v(0) = 00) 
length of the shortest addition chain to compute e 
Euler’s totient function, #{m:0<m<nA(m,n) = 1} 
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Notation XV 


for a polynomial A, the degree of A 
for a power series A = F a;z4, 
ord(A) = min{j : a; 4 0} (ord(0) = +00) 


exponential function 

natural logarithm 

base-b logarithm In(z) / In(b) 

base-2 logarithm In(x) / In(2) = logy (x) 
logarithm to any fixed base 

(log x)* 


ceiling function, min{n € Z:n> a} 
floor function, max{n € Z:n< x} 
nearest integer function, | + 1/2| 


+1lifn>0,—-lifn < 0,and0ifn =0 
\lg(n)| + lifn > 0, 0ifn =0 


closed interval {2 € R: a < x < b} (empty ifa > 6) 
open interval {c € R: a < x < b} (empty if a > b) 
half-open intervals, a < 7 < b,a < x < b respectively 


a 
column vector ( h ) 


2 x 2 matrix (< .) 
Ge a 


element of the (forward) Fourier transform of vector a 
element of the backward Fourier transform of vector a 


c, No such that | f(n)| < cg(n) for all n > no 
c > 0, 1no such that | f(n)| > cg(n) for all n > no 
(n) = O(g(n)) and g(n) = O(f(n)) 
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wBawa WW 


123456789 (for large integers, we may use a space after 
every third digit) 
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> <text> 


Notation 


a number xxrx.yyy written in base p; 
for example, the decimal number 3.25 is 11.012 in binary 


continued fraction a/(b + c/(d+ e/(f +---))) 

. . a b 
determinant of a matrix A, e.g. al ad — bc 

Cc 

Cauchy principal value integral, defined by a limit 
if f has a singularity in (a, b) 
concatenation of strings s and t 
comment in an algorithm 


end of a proof 
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Integer arithmetic 


In this chapter, our main topic is integer arithmetic. However, we 
shall see that many algorithms for polynomial arithmetic are sim- 
ilar to the corresponding algorithms for integer arithmetic, but 
simpler due to the lack of carries in polynomial arithmetic. Con- 
sider for example addition: the sum of two polynomials of degree 
n always has degree at most n, whereas the sum of two n-digit in- 
tegers may have n + 1 digits. Thus, we often describe algorithms 
for polynomials as an aid to understanding the corresponding 
algorithms for integers. 


1.1 Representation and notations 


We consider in this chapter algorithms working on integers. We distinguish 
between the logical — or mathematical — representation of an integer, and its 
physical representation on a computer. Our algorithms are intended for “large” 
integers — they are not restricted to integers that can be represented in a single 
computer word. 

Several physical representations are possible. We consider here only the 
most common one, namely a dense representation in a fixed base. Choose an 
integral base 3 > 1. (In case of ambiguity, 3 will be called the internal base.) 
A positive integer A is represented by the length n and the digits a; of its base 
@ expansion 


A= an18" | +++: +418 +40, 


where 0 < a; < @— 1, and a,_ 1 is sometimes assumed to be non-zero. 
Since the base (3 is usually fixed in a given program, only the length n and 
the integers (a;)o<j<n need to be stored. Some common choices for { are 
2°? on a 32-bit computer, or 2°4 on a 64-bit machine; other possible choices 
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are respectively 10° and 10/9 for a decimal representation, or 2° when using 
double-precision floating-point registers. Most algorithms given in this chapter 
work in any base; the exceptions are explicitly mentioned. 

We assume that the sign is stored separately from the absolute value. This 
is known as the “sign-magnitude” representation. Zero is an important special 
case; to simplify the algorithms we assume that n = 0 if A = 0, and we usually 
assume that this case is treated separately. 

Except when explicitly mentioned, we fissqme that all operations are off-line, 
ie. all inputs (resp. outputs) are completely known at the beginning (resp. end) 
of the algorithm. Different models include /azy and relaxed algorithms, and 
are discussed in the Notes and references ($1.9). 


1.2 Addition and subtraction 


As an explanatory example, here is an algorithm for integer addition. In the 
algorithm, d is a carry bit. 

Our algorithms are given in a language that mixes mathematical notation 
and syntax similar to that found in many high-level computer languages. It 
should be straightforward to translate into a language such as C. Note that 
“:=” indicates a definition, and “<—” indicates assignment. Line numbers are 
included if we need to refer to individual lines in the description or analysis of 


the algorithm. 


Algorithm 1.1 IntegerAddition 

Input: A= > ae a3’, B= > 6:8", carry-in 0 < din <1 

Output: C := by ee ¢, 3° and 0 < d <1 such that A+ B+dj, = dB" +C 
1: d—din 

2: for i from 0 ton — 1 do 

3: s-—a,+b,+d 

4 

5 


(d,c;) — (s div 6, s mod £) 
: return C,d. 


Let T be the number of different values taken by the data type representing 
the coefficients a;, b;. (Clearly, G@ < T, but equillty does not necessarily hold, 
for example @ = 10° and T = 2°?.) At step 3, the value of s can be as 
large as 23 — 1, which is not representable if 6 = T. Several workarounds 
are possible: either use a machine instruction that gives the possible carry of 
a; + b;, or use the fact that, if a carry occurs in a; + b;, then the computed 
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sum — if performed modulo T — equals t 0 a;+b;—T < a;; thus, comparing 
t and a, will determine if a carry occurred. A third solution is to keep a bit in 
reserve, taking 3 < 7/2. 

The subtraction code is very similar. Step 3 simply becomes s ~— a;—b;+d, 
where d € {—1,0} is the borrow of the subtraction, and —G < s < (3. The 
other steps are unchanged, with the invariant A — B+ di, = dB" +C. 

We use the arithmetic complexity model, where cost is measured by the 
number of machine instructions performed, or equivalently (up to a constant 
factor) the time on a single processor. 

Addition and subtraction of n-word integers cost O(n), which is negligible 
compared to the multiplication cost. However, it is worth trying to reduce the 
constant factor implicit in this O(n) cost. We shall see in §1.3 that “fast” mul- 
tiplication algorithms are obtained by replacing multiplications by additions 
(usually more additions than the multiplications that they replace). Thus, the 
faster the additions are, the smaller will be the thresholds for changing over to 
the “fast” algorithms. 


1.3 Multiplication 


A nice application of large integer multiplication is the Kronecker—Schdénhage 
trick, also called segnhentution or substitution by some authors. Assume we 
want to multiply two polynomials, A(a) and B(x), with non-negative integer 
coefficients (see Exercise 1.1 for negative coefficients). Assume both polyno- 
mials have degree less than n, and the coefficients are bounded by p. Now take 
apower X = 3" > np? of the base 3, and multiply the integers a = A(X) and 
b = B(X) obtained by evaluating A and B atx = X.If C(x) = A(x) B(x) = 
> Gx", we clearly have C(X) = >> ¢;X*. Now since the c; are bounded by 
np? < X, the coefficients c; can be retrieved by simply “reading” blocks of k 
words in C'(X ). Assume for example that we want to compute 


(6x° + 62° + 4a® + 92? + 2 +3)(7a* + 2° +227 +247), 


with degree less than n = 6, and coefficients bounded by p = 9. We can take 
X = 10° > np?, and perform the integer multiplication 


6 006 004 009 001 003 x 7001 002 001 007 
= 42 048 046 085 072 086 042 070010 021, 


from which we can read off the product 


A2Qx° + 48a° + 46a" + 850° + 72x? + 8604 + 42° + 70x? + 10a + 21. 
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Conversely, suppose we want to multiply two integers a = \Yycjen a," 
and b = Yiocjcn bj’. Multiply the polynomials A(x) = Mo<ien 4x" and 
B(x) = No<jen 5j%’, obtaining a polynomial C(x), then evaluate C(x) at 
x = ( to obtain ab. Note that the coefficients of C(a) may be larger than (3, in 
fact they may be up to about n3?. For example, with a = 123, b = 456, and 
GB = 10, we obtain A(x) = x? + 2x + 3, B(x) = 4x? + 5x + 6, with product 
C(x) = 4a* +132? + 282? + 277 +18, and C(10) = 56088. These examples 
demonstrate the analogy between operations on polynomials and integers, and 
also show the limits of the analogy. 

A common and very useful notation is to let 1/(n) denote the time to mul- 
tiply n-bit integers, or polynomials of degree n — 1, depending on the context. 
In the polynomial case, we assume that the cost of multiplying coefficients is 
constant; this is known as the arithmetic complexity model, whereas the bit 
complexity model also takes into account the cost of multiplying coefficients, 
and thus their bit-size. 


1.3.1 Naive multiplication 


Algorithm 1.2 BasecaseMultiply 
Input: A= 71 a,6', B= yo" *b,64 
Output: C = AB := = cp, 3" 

1: CH—A-bo 

2: for j from 1 ton — 1 do 

3: C—C+ B)(A-b;) 


4: return C. 


Theorem 1.1 Algorithm BasecaseMultiply computes the product AB 
correctly, and uses O(mn) word operations. 


The multiplication by 3 at step 3 is trivial with the chosen dense representa- 
tion; it simply requires shifting by 7 words towards the most significant words. 
The main operation in Algorithm Bas¢chseMultiply is the computation of 
A - bj and its accumulation into C’ at step 3. Since all fast algorithms rely on 
multiplication, the most important operation to optimize in multiple-precision 
software is thus the multiplication of an array of m words by one word, with 
accumulation of the result in another array of m + 1 words. 
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We sometimes call Algorithm BasecaseMultiply schoolbook multiplication 
since it is close to the “long multiplication” algorithm that used to be taught at 
school. 

Since multiplication with accumulation usually makes extensive use of the 
pipeline, it is best to give it arrays that are as long as possible, which means 
that A rather than B should be the operand of larger size (i.e. m > 7). 


1.3.2 Karatsuba’s algorithm 


Karatsuba’s algorithm is a “divide and conquer” algorithm for multiplication 
of integers (or polynomials). The idea is to reduce a multiplication of length n 
to three multiplications of length n/2, plus some overhead that costs O(n). 

In the following, no > 2 denotes the threshold between naive multiplica- 
tion and Karatsuba’s a eh which is used for no-word and larger inputs. 
The optimal “Karatsu 
words, depending on the processor and on the relative cost of multiplication 
and addition (see Exercise 1.6). 


eshold” no can vary from about ten to about 100 


Algorithm 1.3 KaratsubaMultiply 
Input: A = 00? a:6', B= 7h * 0; 4 
Output: C = AB := 93""* 
ifm < no then return BasecaseMultiply(A, B) 
k — [n/2] 
(Ao, Bo) := (A, B) mod 6", (A, By) := (A, B) div B* 
sa — sign(Ap — Ai), sp <— sign(Bo — By) 
Co <— KaratsubaMultiply(Ao, Bo) 
C, — KaratsubaMultiply(A,, B,) 
C2 — KaratsubaMultiply(|Ao — A:|,|Bo — Bi]) 
return C' := Co + (Co + Cy = 848 BC) 3" + C182". 


Theorem 1.2 Algorithm KaratsubaMultiply computes the product AB 
correctly, using IK (n) = O(n) word multiplications, with a = |g 3 ~ 1.585. 


Proof. Since s4|Ap — Ai| = Ao — Ai and sg|Bo — Bi| = Bo — Bi, we 
have 848p|Ag = A;||Bo = B,| = (Ao = A1)(Bo = By), and thus C = 
Ap Bo+(Ao Pi + A, Bo) 8" + A,B, B?*. 

Since Ap, Bo, |Ao — Ai| and | Bo — B;| have (at most) [7/2] words, and A; 
and B, have (at most) |n/2] words, the number /(n) of word multiplications 
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satisfies the recurrence K(n) = n? forn < no, and K(n) = 2K([n/2]) + 
K(|n/2|) for n > no. Assume 2°~!ng <n < 2'no with £ > 1. Then K(n) 
is the sum of three K(j) values with 7 < 2°~!no, so at most 3° K(j) with 
j < no. Thus, K(n) < 3'max(K (no), (no — 1)?), which gives K(n) < Cn® 
with C = 3!—'8("0) max(K (ng), (no — 1)?). 


Different variants of Karatsuba’s algorithm exist; the variant presented here 
is known as the subtractive version. Another classical one is the additive ver- 
sion, which uses Ap+A, and Bo+ B, instead of | Ap—A1| and | Byo—B,|. How- 
ever, the subtractive version is more convenient for integer arithmetic, since it 
avoids the possible carries in Ag + A; and Bo + Bi, which require either an 
extra word in these sums, or extra additions. 

The efficiency of an implementation of Karatsuba’s algorithm depends heav- 
ily on memory usage. It is important to avoid allocating memory for the inter- 
mediate results | Ao — Ai|, |Bo — Bi|, Co, Ci, and C2 at each step (although 
modern compilers are quite good at optimizing code and removing unneces- 
sary memory references). One possible solution is to allow a large temporary 
storage of m words, used both for the in! edi sults and for the recur- 
sive calls. It can be shown that an auxiliary space of m = 2n words — or even 
m = O(log n) — is sufficient (see Exercises 1.7 and 1.8). 

Since the product C2 is used only once, it may be faster to have auxiliary 
routines KaratsubaAddmul and KaratsubaSubmul that accumulate their re- 
sults, calling themselves recursively, together with KaratsubaMultiply (see 
Exercise 1.10). 

The version presented here uses ~ 4n additions (or subtractions): 2 x (n/2) 
to compute |Aj — Aj| and |Bo — By|, then n to add Co fand C1, again n to 
add or subtract C2, and n to add (Cp + Cy — 848 BC) 3" to Cp + C107". An 
improved scheme uses only ~ 7n/2 additions (see Exercise 1.9). 

When considered as algorithms on polynomials, most fast multiplication 
algorithms can be viewed as evaluation/interpolation algorithms. Karatsuba’s 
algorithm regards the inputs as polynomials Aj + A,2 and By + B,x evaluated 
at x = *; since fe product C'(a) is of degree 2, Lagrange’s interpolation 
theorem says that'it is sufficient to evaluate C(x) at three points. The subtrac- 
tive version evaluates! C(a) at x = 0,—1,00, whereas the additive version 
uses © = 0,+1, 00. 


1.3.3. Toom—Cook multiplication 


Karatsuba’s idea readily generalizes to what is known as Toom—Cook r-way 
multiplication. Write the inputs as ag +---+a,—12"~! and by +-+-+b,_y2"71, 


1 Evaluating C(a) at oo means computing the product Aj Bi of the leading coefficients. 
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with x = 6", and k = [n/r]. Since their product C(x) is of degree 2r — 2, 
it suffices to evaluate it at 2r — 1 distinct points to be able to recover C(x), 
and in particular C'(3"). If r is chosen optimally, Toom—Cook multiplication 
of n-word numbers takes time n!+O(1/vleg”), 

Most references, when describing subquadratic multiplication algorithms, 
only describe Karatsuba and FFT-based algorithms. Nevertheless, the Toom— 
Cook algorithm is quite interesting in practice. 

Toom—Cook r-way reduces one n-word product to 2r — 1 products of about 
n/r words, thus costs O(n”) with v = log(2r — 1)/ log. However, the con- 
stant hidden by the big-O notation depends strongly on tpe-evqluation and 
interpolation formule, which in turn depend on the chosen points. One possi- 
bility is to take —(r — 1),...,-—1,0,1,..., (7 — 1) as evaluation points. 

The case r = 2 corresponds to Karatsuba’s algorithm (§1.3.2). The case 
r = 3 is known as Toom—Cook 3-way, sometimes simply called “the Toom— 
Cook algorithm”. Algorithm ToomCook3 uses the evaluation points 0, 1, —1, 
2, oo, and tries to optimize the evaluation and interpolation formule. 


Algorithm 1.4 ToomCook3 
Input: two integers 0 < A, B < 8" 
Output: AB := cp + 1,8" + c28?* + c33°* + c48** with k = [n/3] 
Require: a threshold n; > 3 
1: ifn <n, then return KaratsubaMultiply(A, B) 
write A = ag +a, + aox?, B= bp + by x + box? with x = B*. 
v9 — ToomCook3(ao, bo) 
Hy ToomCook3(ao2+41, bo2+b1) where ag2 ao +a2, bo2 — bo tbe 
Uy = ToomCoo0k3 (a2 — a4, bo2 = bi) 
v2 <— ToomCook3(ao + 2a, 4 4ao, bo t 2b; + Abo) 
Uso — ToomCoo0k3 (az, b2) 
ty (30 + 2v_4- v2)/6 20, ta — (v1 + v_1)/2 
Co VO, C1 ia ty, C2 tg Vo Veo, €3 ty to, C4 Uso: 


SOE OR: att ON Oe GRE INS 


The divisions at are exact; if 3 is a power of two, the division by 6 
can be done using ae by 2 — which consists of a single shift — followed 
by a division by 3 (see §1.4.7). 

Toom-—Cook r-way has to invert a (2r — 1) x (2r — 1) Vandermonde matrix 
with parameters the evaluation points; if we choose consecutive integer points, 
the determinant of that matrix contains all primes up to 2r — 2. This proves 
that division by (a multiple of) 3 cannot He avoided for Toom—Cook 3-way 
with consecutive integer points. See Exercise 1.14 for a generalization of this 
result. 
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1.3.4 Use of the fast Fourier transform (FFT) 


Most subquadratic multiplication algorithms can be seen as evaluation-inter- 
polation algorithms. They mainly differ in the number of evaluation points, and 
the values of those points. However, the evaluation and interpolation formule 
become intricate in Toom—Cook r-way for large r, since they involve O(r?) 
scalar operations. The fast Fourier transform (FFT) is a way to perform evalu- 
ation and interpolation efficiently for some special points (roots of unity) and 
special values of r. This explains why multiplication algorithms with the best 
known asymptotic complexity are based on the FFT. 

There are different flavours of FFT multiplication, dep¢nfing on the ring 
where the operations are performed. The Schénhage-Strassen algorithm, with 
a complexity of O(n log n log log n), works in the ring Z/(2” + 1)Z. Since it 
is based on modular computations, we describe it in Chapter 2. 

Other commonly used algorithms work with floating-point complex num- 
bers. A d ack is that, due to the inexact nature of floating-point computa- 
tions, ac ] error a is required to guarantee the correctness of the im- 
plementation, assuming an underlying arithmetic with rigorous error bounds. 
See Theorem 3.6 in Chapter 3. 

We say that multiplication is in the FFT range if n is large and the multi- 
plication algorithm satisfies M(2n) ~ 2M(n). For example, this is true if the 
Schénhage-Strassen multiplication algorithm is used, but not if the classical 
algorithm or Karatsuba’s algorithm is used. 


1.3.5 Unbalanced multiplication 


The subquadratic algorithms considered so far (Karatsuba and Toom—Cook) 
work with equal-size operands. How do we efficiently multiply integers of dif- 
ferent sizes with a subquadratic algorithm? This case is important in practice, 
but is rarely considered in the literature. Assume the larger operand has size 
m, and the smaller has size n < m, and denote by M/(m, n) the corresponding 
multiplication cost. 

If evaluation-interpolation algorithms are used, the cost depends mainly on 
the size of the result, i.e. m+n, so we have M(m,n) < M((m + n)/2), at 
least approximately. We can do better than M/((m-+n)/2) if n is much smaller 
than m, for example M/(m, 1) = O(m). 

When m is an exact multiple of n, say m = kn, a trivial strategy is to cut the 
larger operand into k pieces, giving M(kn,nL= HM (n) + O(kn). However, 
this is not always the best strategy, see Exercise 1.16. 
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When m is not an exact multiple of n, several strategies are possible: 


e split the two operands into an equal number of pieces of unequal sizes; 
e or split the two operands into different numbers of pieces. 


Each strategy has advantages and disadvantages. We discuss each in turn. 


First strategy: equal number of pieces of unequal sizes 
Consider for example Karatsuba multiplication, and let (m,n) be the num- 
ber of word-products for an m x n product. Take for example m = 5, n = 3. 
A natural idea is to pad the smaller operand to the size of the larger one. How- 
ever, there are several ways to perform this padding, as shown in the following 
figure, where the “Karatsuba cut” is represented by a double column: 


| a4 | a3 a2 | a1 | ao aa | a3 a2 | a1 | ao a4 | a3 a2 | G1 | ao 
| bs | bi | bo bo ll b: | bo bo | bi |l bo 
AxB A x (8B) A x (87B) 


The left variant leads to two products of size 3, i.e. 2K (3, 3), the middle one to 
K(2,1)+ (8, 2) +4 (3, 3), and the right one to A (2, 2) +A (3, 1)+ (3,3), 
which give respectively 14, 15, 13 word-products. 

However, whenever m/2 < n < m, any such “padding variant” will re- 
quire K({m/2], [m/2]) for the product of the differences (or sums) of the 
low and high parts from the operands, due to a “wrap-around” effect when 
subtracting the parts from the smalldr opgrand; this will ultimately lead to a 
cost similar to that of an m x m product. The “‘odd—even scheme” of Algorithm 
OddEvenKaratsuba (see also Exercise 1.13) avoids this wrap-around. Here is 
an example of this algorithm for m = 3 and n = 2. Take A = aga? +a ,x+a9 
and B = bya + bo. This yields Ao =a2v+ do, A, a4, Bo bo, By bi; 
thus, Co = (a2x + ag)bo, Ci = (agx + a9 + ay) (bo + bi), Co = a,b). 


Algorithm 1.5 OddEvenKaratsuba 
Input: A = — ax’, B= a b2),m>n>1 
Output: A.B 
if n = 1 then return 1"! ajboat 
write A = Ao(a?) + «Aj (27), B = Bo(a?) + xB, (27) 
Co <— OddEvenKaratsuba( Ao, Bo) 
C, — OddEvenKaratsuba(Ap + Ai, Bo + Bi) 
C2 — OddEvenKaratsuba( A, B,) 
return Oo (a?) + (Cy — Co — C2) (x?) + 27Co(2?). 
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We therefore get K(3,2) = 2K(2,1) + K(1) = 5 with the odd-even 
scheme. The general recurrence for the odd—even scheme is 


K (m,n) = 2K([m/2], [n/2]) + K(|m/2], [n/2]), 
instead of 
K (m,n) = 2K([m/2], |m/2]) + K(|m/2],n — [m/2]) 


for the classical variant, assuming n > m/2. We see that the second parameter 
in K(-,-) only depends on the smaller size n for the odd—even scheme. 

As for the classical variant, there are several ways of padding with the odd— 
even scheme. Consider m = 5, n = 3, and write A := agx* + agx? + aga? + 
a,x + a9 = tA,(x?) + Ao(x?), with Ai(x) = agx + a1, Ao(x) = agx? + 
agx +a; and B := box? + bx + bo = By (x7) + Bo(2?), with By (x) = bi, 
Bo(x) = boa+bo. Without padding, we write AB = x?(A,B,)(x7)+2x((Ao+ 
A;)(Bo Ps By) a A,B, — Ao Bo)(2?) + (Ap Bo)(2?), which gives K(5, 3) 
K (2,1) + 2K(3,2) = 12. With padding, we consider xB = xB}(a?) + 
Bo(a?), with BY (x) = boa+ bo, Bh = bx. This gives K (2,2) = 3 for A,B}, 
K (3,2) = 5 for (Ap + A1)(Bo + Bi), and K(3,1) = 3 for Ap BG — taking 
into account the fact that Bj has only one non-zero coefficient — thus, a total 
of 11 only. 


Note that when the variable x corresponds to say 6 = 2%, Algorithm 
OddEvenKaratsuba as presented above is not very practical in the integer 
case, because of a problem with carries. For example, in the sum Ag + A; we 
have |m/2| carries to store. A workaround is to consider x to be say 31°, in 
which case we have to store only one carry bit for ten words, instead of one 
carry bit per word. 

The first strategy, which consists in cutting the operands into an equal num- 
ber of pieces of unequal sizes, does not scale up nicely. Assume for example 
that we want to multiply a number of 999 words by another number of 699 
words, using Toom—Cook 3-way. With the classical variant — without padding — 
and a “large” base of 3°°%, we cut the larger operand into three pieces of 333 
words and the smaller one into two pieces of 333 words and one small piece of 
33 words. This gives four full 333 x 333 products — ignoring carries — and one 
unbalanced 333 x 33 product (for the evaluation at x = 00). The “odd-even” 
variant cuts the larger operand into three pieces of 333 words, and the smaller 
operand into three pieces of 233 words, giving rise to five equally unbalanced 
333 x 233 products, again ignoring carries. 


1.3 Multiplication 11 


Second strategy: different number of pieces of equal sizes 


Instead of splitting unbalanced operands into an equal number of pieces — 
which are then necessarily of different sizes — an alternative strategy is to split 
the operands into a different number of pieces, and use a multiplication al- 
gorithm which is naturally unbalanced. Consider again the example of multi- 
plying two numbers of 999 and 699 words. Assume we have a multiplication 
algorithm, say Toom-(3, 2), which multiplies a number of 3n words by another 
number of 2n words; this requires four products of numbers of about n words. 
Using n = 350, we can split the larger number into two pieces of 350 words, 
and one piece of 299 words, and the smaller number into one piece of 350 
words and one piece of 349 words. 

Similarly, for two inputs of 1000 and 500 words, we can use a Toom-(4, 2) 
algorithm, which multiplies two numbers of 4n and 2n words, with n = 250. 
Such an algorithm requires five evaluation points; if we choose the same points 
as for Toom 3-way, then the interpolation phase can be shared between both 
implementations. 

It seems that this second strategy is not compatible with the ““odd—even” 
variant, which requires that both operands are cut into the same number of 
pieces. Consider for example the “odd-even” variant modulo 3. It writes the 
numbers to be multiplied as A = a(3) and B = b(3) with a(t) = ag(t?) + 

[tas (t?) +422 (t#), and similarly b(t) = bo (t?)+tb, (t?) +42b2(t?). We see that 
the number of piecef ofkach operand is the chosen modulus, here 3 (see Exer- 
cise 1.11). Experimental results comparing different multiplication algorithms 
are illustrated in Figure 1.1. 


Asymptotic complexity of unbalanced multiplication 
Suppose m > n and n is large. To use an evaluation-interpolation scheme, 
we need to evaluate the product at m + n points, whereas balanced k by k 
multiplication needs 2k points. Taking k = (m-+n)/2, we see that M(m,n) < 
M((m + n)/2)(1 + 0(1)) as n — oo. On the other hand, from the discussion 
above, we have M(m,n) < [m/n]M(n). This explains the upper bound on 
M (m,n) given in the Summary of complexities at the end of the book. 


1.3.6 Squaring 


In many applications, a significant proportion of the multiplications have equal 
operands, i.e. are squarings. Hence, it is worth tuning a special squaring im- 
plementation as much as the implementation of multiplifation itself, bearing 
in mind that the best possible speedup is two (see Exercise 1.17). 
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4 18 32 46 = 60t—=«*i24 88 102 116 130 144 158 


11 be be 

18 be be 22 

25 be be be: 22 

32 be be be be 22 

39 be be be 32 32 33 

46 bc be be 32 32 32 22 

53: be be be be 32 32. 32 22 

60 be be be be 32 32 32 32 22 

67 bc be be be 42 32 32 32 33 33 

74 be be be be 42 32 32 32 32 33 33 

81 be be be be 32 32 32 32 32 33 33 33 

88 be be be be 32 42 42 32 32 32 33 33 33 

95 be be be be 42 42 42 32 32 32 33 33 33 22 
102 be be be be 42 42 42 42 32 32 32 33 33 44 33 
109 be be be be be 42 42 42 42 32 32 32 33 32 44 44 
116 be be be be be 42 42 42 42 32 32 32 32 32 44 44 44 
123 be be be be be 42 42 42 42 42 32 32 32 32 44 44 44 44 
130 be be be be be 42 42 42 42 42 42 32 32 32 44 44 44 44 44 
137 be be be be be 42 42 42 42 42 42 32 32 32 33 33 44 33 33 33 
144 be be be be be 42 42 42 42 42 42 32 32 32 32 32 33 44 33 33 33 
151 be be be be be 42 42 42 42 42 42 42 32 32 32 32 33 33 33 33 33 33 
158 be be be be be be 42 42 42 42 42 42 32 32 32 32 32 33 33 33 33 33 33 


Figure 1.1 The best algorithm to multiply two numbers of x and y words 
for 4 < # < y < 158: bc is schoolbook multiplication, 22 is Karatsuba’s 
algorithm, 33 is Toom-3, 32 is Toom-(3, 2), 44 is Toom-4, and 42 is Toom- 
(4, 2). This graph was obtained on a Core 2, with GMP 5.0.0, and GCC 4.4.2. 
Note that for « < (y + 3)/4, only the schoolbook multiplication is avail- 
able; since we did not consider the algorithm that cuts the larger operand into 
several pieces, this explains why i for say x = 32 and y = 158. 


For naive multiplication, Algorithm 1.2 BasecaseMultiply can be modified 
to obtain a theoretical speedup of two, since only about half of the products 
a,b; need to be computed. 

Subquadratic algorithms like Karatsuba and Toom—Cook r-way can be spe- 
cialized for squaring too. In general, the threshold obtained is larger than the 
corresponding multiplication threshold. For example, on a modern 64-bit com- 
puter, we can expect a threshold between the naive quadratic squaring and 
Karatsuba’s algorithm in the 30-word range, between Karatsuba and Toom— 
Cook 3-way in the 100-word range, between Toom—Cook 3-way and Toom— 
Cook 4-way in the 150-word range, and between Toom—Cook 4-way and the 
FFT in the 2500-word range. 

The classical approach for fast squaring is to take a fast multiplication algo- 
rithm, say Toom—Cook r-way, and to replace the 27 — 1 recursive products by 
2r—1 recursive squarings. For example, starting from Algorithm ToomCook3, 
we obtain five recursive squarings a2, (a9 + a1 + a2)*, (a9 — a1 + a2)?, 
(ag + 2a; + 4a2)?, and a3. A different approach, called asymmetric squaring, 
is to allow products that are not squares in the recursive calls. For example, 
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mpn_mul_n ' ; 
mpn_sqr ------- 
1 
0.8 - pul A h 4 
i ‘ , on fh TaNAE ud afi | i 
0.6 1 whe | 
045+ 4 
0.25 4 
0 1 it 1 1 L 
1 10 100 1000 10000 100000 1e+06 


Figure 1.2 Ratio of the squaring and multiplication time for the GNU MP 
library, version 5.0.0, on a Core 2 processor, up to one million words. 


the square of a23? + a1 + ag is c46* + c38° + c28? + eG + co, where 
2 2 
C4 = A5, 03 = 20102, C2 = Co + C4 — 8, C1 = 2a1 a0, and co = ag, where 


8S = (do — a2 + a1) (a9 — a2 — a1). This formula performs two squarings, 
and three normal products. Such asymmetric squaring formule are not asymp- 
toticfitty} optimal, but might be faster in some medium range, due to simpler 
evaluation or interpolation phases. 

Figure 1.2 compares the multiplication and squaring time with the GNU MP 
library. It shows that whatever the word range, a good rule of thumb is to count 
2/3 of the cost of a product for a squaring. 


1.3.7 Multiplication by a constant 


It often happens that the same multiplier is used in several consecutive oper- 
ations, or even for a complete calculation. If this constant multiplier is small, 
i.e. less than the base 3, not much speedup can be obtained compared to the 
usual product. We thus consider here a “large” constant multiplier. 

Wheh using evaluation-interpolation algorithms, such as Karatsuba or Toom— 
Cook (see §1.3.2—1.3.3), we may store the evaluations for that fixed multiplier 
at the different points chosen. 
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Special-purpose algorithms also exist. These algorithms differ from classi- 
cal multiplication algorithms because they take into account the value of the 
given constant multiplier, and not only its size in bits or digits. They also dif- 
fer in the model of complexity used. For example, R. Bernstein’s algorithm 
[27], which is used by several compilers to compute addresses in data struc- 
ture records, considers as basic operation x, y > 2'x + y, with a cost assumed 
to be independent of the integer 7. 

For example, Bernstein’s algorithm computes 200612 in five steps: 


aj:=3le = 2e-—-2 
tq:= 93a = 2Qa,+21 
23:=743¢ = Bar—-—2 
t4:= 6687 = 2223423 
200612 = 2tay+ay4. 


1.4 Division 


Division is the next operation to consider after multiplication. Optimizing di- 
vision is almost as important as optimizing multiplication, since division is 
usually more expensive, thus the speedup obtained on division will be more 
significant. On the other hand, we usually perform more multiplications than 
divisions. 

One strategy is to avoid divisions when possible[ or replace them by multi- 
plications. An example is when the same divisor is used for several consecutive 
operations; we can then precompute its inverse (see §2.4.1). 

We distinguish several kinds of division: full division computes both quo- 
tient and remainder, while in other cases only the quotient (for example, when 
dividing two floating-point significands) or remainder (when multiplying two 
residues modulo 7) is needed. We also discuss exact division — when the 
remainder is known to be zero — and the problem of dividing by a single word. 


1.4.1 Naive division 


In all division algorithms, we assume that divisors are normalized. We say that 
B= oe b; GB) is normalized when its most significant word b,,_ 1 satisfies 
bn—1 > 3/2. This is a stricter condition (for G > 2) than simply requiring that 
b,—1 be non-zero. 

If B is not normalized, we can compute A’ = 2"A and B’ = 2*B so 
that B’ is normalized, then divide A’ by B’ giving A’ = Q’B’ + R’. The 


1.4 Division 15 


Algorithm 1.6 BasecaseDivRem 
Input: A = ea a3’, B= page b; 37, B normalized, m > 0 
Output: quotient Q and remainder R of A divided by B 

1: ifA> 6B theng, — 1, A— A-— BB else g, — 0 

2: for 7 from m — 1 downto 0 do 


3: Gi — [(an4j8 + Qn4j-1)/bn-1] > quotient selection step 
4: qj — min(gj, 6 — 1) 

s A-A-gGGB 

6: while A < 0 do 

T oY -— qr 1 

8: A-—A+ 3B 


9: return Q = 9 Gf’, R= A. 
(Note: in step 3, a; denotes the current value of the ith word of A, which may 
be modified at steps 5 and 8.) 


quotient and remainder of the division of A by B are, respectively, Q := Q’ 
and R := R’/ 2"; the latter division being exact. 


Theorem 1.3 Algorithm BasecaseDivRem correctly computes the quotient 
and remainder of the division of A by a normalized B, in Onin + 1)) word 
operations. 


Proof. We prove that the invariant A < 6/+1'B holds at step 2. This holds 
trivially for 7 = m — 1: B being normalized, A < 23" B initially. 

First consider the case g; = qj- Then qjbn—1 > Qn4j38+Gn4j-1—bn-1 +1, 
and therefore 


A— qj 8° B < (bn—1 — 1)6"*271 + (A mod 6"*3—"), 


which ensures that the new a,4; vanishes, and an+j—1 Lu, thus, 
A < 6B after step 5. Now A may become negative after step 5, but, since 
djon—1 < An45 8 + Gn4j—1, we have 


A — qj) B > (an4j8 + Qn4j—1)8"%? | = Gj (bn-1 8" | + 8") 8! 
cat (Aa 


Therefore, A — q; PHI 23) B > (2bn-1- q)antits 0, which proves that 
the while-loop at steps 6—8 is performed at most twice [142, Theorem 4.3.1.B]. 
When the while-loop is entered, A may increase only by BB at a time; hence, 
A < 3B at exit. 
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In the case qj; # gj, ie. gq; = (, we have before the while-loop 
A < 6/*1B — (8 —1)6/B = BB; thus, the invariant holds. If the while- 
loop is entered, the same reasoning as above holds. 

We conclude that when the [fbr-loop ends, 0 < A < B holds, and, since 
(30) q;57)B + A is invariant throughout the algorithm, the quotient Q and 
remainder R are correct. 

The most expensive part is step 5, which costs O(n) operations for q; B (the 
multiplication by 3 is simply a word-shift); the total cost is O(n(m + 1)). 
(For m = 0, we need O(n) work if A > B, and even if A < B to compare the 
inputs in the case A = B — 1.) 


Here is an example of algorithm BasecaseDivRem for the inputs 
A = 766970 544 842 443 844 and B = 862664913, with G = 1000, which 
gives quotient Q = 889 071 217 and remainder R = 778 334 723. 


A qj A-qgBp after correction 


766 970 544 842 443 844 889 61437185 443 844 no change 
61 437 185 443 844 O71 187 976 620 844 no change [] 
187976620844 218 —84330190 778334723 


Co rF bb R. 


Algorithm BasecaseDivRem simplifies when A < (”B: remove step 1, 
and change m into m — 1 in the return value Q. However, the more general 
form we give is more convenient for a computer implementation, and will bel 
used below. 

A possible variant when q7 > (is to let qj = 3; then A — q; GB at step 5 
reduces to a single subtraction of B shifted by 7 + 1 words. However, in this 
case the while-loop will be performed at least once, which corresponds to the 
identity A— (8 —1)6)/B = A— B7*'B + BIB. 

If instead of having B normalized, i.e. b,, > 6/2, we have b, > 3/k, there) 
can be up to /; iterations of the while-loop (and step 1 has to be modified). 

A drawback of Algorithm BasecaseDivRem is that the test A < 0 at line 6 
is true with non-negligible probability; therefore, branch prediction algorithms 
available on modern processors will fail, resulting 1 ted cycles. A work- 
around is to compute a more accurate partial ee (am we to decrease the 
proportion of corrections to almost zero (see Exercise 1.20). 


1.4.2 Divisor [preconditioning 


Sometimes the quotient selection — step 3 of Algorithm BasecaseDivRem -— is 
quite expensive compared to the total cost, especially for small sizes. Indeed, 
some processors do not have a machine instruction for the division of two 
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words by one word; one way to compute qj is then to precompute a one-word 
approximation of the inverse of b,,_;, and to multiply it by an45 + @n4j—1- 

Svoboda’s algorithm makes the quotient selection trivial, after preconditibn- 
ing the divisor. The main idea is that if b,,_1 equals the base @ in Algorithm 
BasecaseDivRem, then the quotient selection is easy, since it suffices to take 
Gj = Gn+j. (In addition, Gj < ( — Lis then always fulfilled; thus, step 4 of 
BasecaseDivRem can be avoided, and qj replaced by qj.) 


Algorithm 1.7 SvobodaDivision 
Input: A= pa eae a, 3’, B= ae b; 39 normalized, A < 8"B,m>1 
Output: quotient Q and remainder R of A divided by B 

- ke [er /B] 

: Bi kB = p14 01 

: for j from m — 1 downto 1 do 


1 

2 

3 

4: qj — On+j > current value of an+; 
s Aa A-—qjBi-1Bl 

6 if A < 0 then 

7 qj—a-1 

8: A-Ac+ BI-1B’ 

x Q =r G6), R =A 

10: (qo, R) — (R’ div B, R’ mod B) > using BasecaseDivRem 
11: return Q = kQ'+q0, R. 


CL 


With the example of §1.4.1, Svoboda’s algorithm would give k = 1160, 
B' = 1000691 299 080: 


j A qj A— q;B’B! after correction 
2 766970544 842 443 844 766 441009747163 844 no change 
1 441 009 747163844 441 —295 1157304386 705575 568 644 


Lad thus get Q’ = 766440 and R’ = 705575 568 644. The final division of 
step 10 gives R’ = 817B + 778334723, and we-get Q = 1160 - 766.440 + 
817 = 889 071 217, and R = 778 334 723, as in §1.4.1. 

Svoboda’s algorithm is especially interesting when only the remainder is 
needed, since then we can avoid the “deconditioning” Q = kQ’ + qo. Note 
that when only the quotient is needed, dividing A’ = kA by B’ = kB is 
another way to compute it. 
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1.4.3 Divide and conquer division 


The base-case division of §1.4.1 determines the quotient word by word. A 
natural idea is to try getting several words at a time, for example replacing the 
quotient selection step in Algorithm BasecaseDivRem by 


%5 bn—18 T Dn—2 


Since q; has then two words, fast multiplication algorithms (§ 1.3) might speed 
up the computation of gq; B at step 5 of Algorithm BasecaseDivRem. 

More generally, the most significant half of the quotient — say Q,, of 
£ = m— k words — mainly depends on the @ most significant words of the 
dividend and divisor. Once a good approximation to @, is known, fast multi- 
plication algorithms can be used to compute the partial remainder A—Q, BB". 
The second idea of the divide and conquer algorithm RecursiveDivRem is to 
compute the corresponding remainder together with the partial quotient Q;; in 
such a way, we only have to subtract the product of Q, by the low part of the 
divisor, before computing the low part of the quotient. 


* a + qr” + An4j—28 + a 


Algorithm 1.8 RecursiveDivRem 

Input: A= ie aif’, B= ae b; 37, B normalized, n > m 
Output: quotient Q and remainder R of A divided by B 
: ifm < 2 then return BasecaseDivRem(A, B) 

: k — |m/2|, Bi — B div B*, By — B mod B* 

: (Qi, Ri) — RecursiveDivRem(A div 67", B;) 

A’ — R,?* + (A mod 6?*) — Qi BoB* 

: while A’ < 0doQ; — Q, —1, A’— A’ + G*B 
(Qo, Ro) — RecursiveDivRem(A’ div 3", B,) 

A" — RoG* + (A’ mod 8*) — QoBo 

while A” < 0do Qo — Qo —1, A” — A” + B 
return Q := Q18" + Qo, R:= A". 


SOIR SE OY OR eh 


O In Algorithm RecursiveDivRem, we may replace the condition m < 2 at 
step | by m < T for any integer T > 2. In practice, T' is usually in the range 
50 to 200. 

We cannot require A < (B at input, since this condition may not be 
satisfied in the recursive calls. Consider for example A = 5517, B = 56 with 
2 = 10: the first recursive call will divide 55 by 5, which yields a two-digit 
quotient 11. Even A < 6” B is not recursively fulfilled, as this example shows. 
The weakest possible input condition is that the n most significant words of A 
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do not exceed those of B, i.e. A < 6'(B + 1). In that case, the quotient is 
bounded by 6™ + |(8" — 1)/B|, which yields 8” + 1 in the case n = m 
(compare Exercise 1.19). See also Exercise 1.22. 


Theorem 1.4 Algorithm RecursiveDivRem is correct, and uses D(n+m,n) 
operations, where D(n + m,n) = 2D(n,n — m/2) + 2M (m/2) + O(n). In 
particular, D(n) := D(2n,n) satisfies D(n) = 2D(n/2)+2M(n/2)+O(n), 
which gives D(n) ~ M(n)/(2°~+ — 1) for M(n) ~ n% a > 1. 


[Prroof. We first check the assumption for the recursifg calls: By ismprmalized 
since it has the same most significant word than B. 

After step 3, we hay = (QR, + R,)6?* + (A mod (;); thus, after 
step 4, A’ = A — Q,8"B, which still holds after step 5. After step 6, we have 
A’ = (QoB, + Ro) B* + (A’ mod f*), ia after step 7, A” = A’ — QoB, 
which still holds after step 8. At step-9, wehave A= QB+ R. 

A div 8?* has m+n — 2k words, 1 has n — k words; thus, 0 < Q) < 
207-* and 0 < Ry < By < B"~". At'step 4, -20+* < A’ < BB. Since 
B is normalized, the while-loop at step 5 is performed at most four times (this 
can happen only when n = m). At step 6, we have 0 < A’ aw thus, 
A’ div 6* has at most n words. 

It follows 0 < Qo < 28" and 0 < Ro < By < B"-*. Hence, at step 
7, —282% < A” < B, and, after at most four iterations at step 8, we have 


0< ALcB. 
— 


Theorem 1.4 gives D(n) ~[217 qn) for Karatsuba multiplication, and D(n) ~ 
2.63M (n) for Toom—Cook 3-way; in the FFT range, see Exercise 1.23. 

The same idea as in Exercise 1.20 applies: to decrease the probability that 
the estimated quotients Q; and Qo are too large, use one extra word of the 
truncated divtdend and divisors in the recursive calls to RecursiveDivRem. 

A graphical view of Algorithm RecyrsiveDivRem in the case m = n is 
given in Figure 1.3, which represents the multiplication QB: we first com- 
pute the lower left corner in ae (step 3), second the tates right corner in 
M(n/2) (step 4), third the upper'teft corner in D(n/2) (step 6), and finally the 
upper right corner in M(n/2) (step 7). 


Unbalanced division 
The condition n > m in Algorithm RecursiveDivRem means that the divi- 
dend A is at most twice as large as the divisor B. When A is more than twice 
ag large as B (m > n with the notation above), a possible strategy (see Ex- 
ercise 1.24) computes n words of the quotient at a time. This reduces to the 
base-case algorithm, replacing @ by 3”. 
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Figure 1.3 Divide and conquer division: a graphical view 
(most significant parts at the lower left corner). 


Algorithm 1.9 UnbalancedDivision 
Input: A = oe a3’, B= saa b; 37, B normalized, m > n 
Output: quotient Q and remainder R of A divided by B 
Q—0 
while m > n do 
(q,r) — RecursiveDivRem(A div 3”"~", B) > 2n by n division 
Q<— QB" +4 
Aerp™" + Amod B™" 
m—m—n 
(q,r) — RecursiveDivRem(A, B) 
retun Q := QB" +q, R:=r. 


C] 


Figure 1.4 compares unbalanced multiplication and division in GNU MP. 
As expected, multiplying 2 words by n — x words takes the same time as 
multiplying n — x words by n words. However, there is no symmetry for the 
division, since dividing n words by x words for 7 < n/2 is more expensive, 
at least for the version of GMP that we used, than dividing n words by n — x 
words. 
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Figure 1.4 Time in 10~° seconds for the multiplication (lower curve) of 2 
words by 1000 — x words and for the division (upper curve) of 1000 words 
by x words, with GMP 5.0.0 on a Core 2 running at 2.83GHz. 


1.4.4 Newton’s method 


Newton’s iteration gives the division algofijhm with best asymptotic complex- 
ity. One basic component of Newton’s iterfition js the computation of an ap- 
proximate inverse. We refer here to Chapter 4. The p-adic version of Newton’s 
method, also called Hensel lifting, is used in §1.4.5 for exact division. 


1.4.5 Exact division 


A division is exact when the remainder is zero. This happens, for example, 
when normalizing a fraction a/b: we divide both a and b by their greatest com- 
mon divisor, and both divisions are exact. If the remainder is known 
a priori to be zero, this information is useful to speed up the computation 
of the quotient. 

Two strategies are possible: 


e use MSB (most significant bits first) division algorithms, without computing 
the lower part of the remainder. Here, we have to take care of rounding 
errors, in order to guarantee the correctness of the final result; or 
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e use LSB (least significant bits first) algorithms. If the quotient is known to 
be less than 3”, computing a/b mod 3” will reveal it. 


Subquadratic algorithms can use both strategies. We describe a idak significant 
bit algorithm using Hensel lifting, which can be viewed as a p-adic version of 
Newton’s method. CL] Oo 

Algorithm ExactDivision uses the Karp—Markstein trick: lines 1-4 compute 
1/B mod gin/ 21 while the two last lines incorporate the dividend to obtain 
A/B mod 3". Note that the middle product (§3.3.2) can be used in lines 4 and 
6, to speed up the computation of 1 — BC and A — BQ, respectively. 


Algorithm 1.10 ExactDivision 
Input: A = 7071 a, 3', B= 0? * BB) 
Output: quotient Q = A/B mod £” 
Require: gcd(bo, 3) = 1 

1: C — 1/bp mod 8 

2: for 7 from [lg] — 1 downto 1 do 

3 k< [n/2*] 

4: C—C+C(1— BC) mod 3* 

5 

6 


: Q@— AC mod B* 
:Q—Q+C(A-— BQ) mod 8". 


A further gain can be obtained by using both strategies simultaneously: com- 
pute the most significant n/2 bits of the quotient using the MSB strategy, and 
the least significant n/2 bits usifig thp LSB strategy. Since a division of size n 
is replaced by two divisions of size n/2, this gives a speedup of up to two for 
quadratic algorithms (see Exercise 1.27). 


1.4.6 Only quotient or remainder wanted 


When both the quotient and remainder of a division are needed, it is best 
to compute them simultaneously. This may seem to be a trivial statement; 
nevertheless, some high-level languages provide both div and mod, but no 
single instruction to compute both quotient and remainder. 

Once the quotient is known, the remainder can be recovered by a single 
multiplication as A — QB; on the other hand, when the cre ea 
the quotient can be recovered by an exact division as (A — R)/B (81.4.5). 

However, it often happens that only one of the quotient or remainder is 
needed. For example, the division of twp Hoating-point numbers reduces to the 
quotient of their significands (see Chapter 3). Conversely, the multiplication of 


O 
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two numbers modulo N reduces to the remainder of their product after divi- 
sion by NV (see Chapter 2). In such cases, we may wonder if faster algorithms 
exist. 

For a dividend of Bh words and a divisor of n words, a significant speedup — 
up to a factor of two for quadratic algorithms — can be obtained when only 
the quotient is needed, since we do not need to update the low n words of the 
current remainder (step 5 of Algorithm BasecaseDivRem). 

It seems difficult to get a similar speedup when only the remainder is re- 
quired. One possibility is to use Svoboda’s algorithm, but this requires some 
precomputation, so is only useful when several divisions are performed with 
the same divisor. The idea is the following: precompute a multiple B, of B, 
having 3n/2 words, the n/2 most significant words being 3”/2. Then re- 
ducing A mod B, requires a single n/2 x n multiplication. Once A is re- 
duced to A, of 3n/2 words by Svoboda’s algorithm with cost 2M/(n/2), use 
RecursiveDivRem on A; and B, which costs D(n/2) + M(n/2). The to- 
tal cost is thus 31/(n/2) + D(n/2), instead_of 2M(n/2) + 2D(n/2) for a 
full division with RecursiveDivRem. ea 5M (n)/3 for Karatsuba and 
2.04M (n) for Toom—Cook 3-way, instead of 2M (n) and 2.63M(n), respec- 
tively. A similar algorithm is described in §2.4.2 (Subquadratic Montgomery 
Reduction) with further optimizations. 


1.4.7 Division by a single word 


We assume here that we want to divide a multiple precision number by a 
one-word integer c. As for multiplication by a fPne-yord integer, this is an 
important special case. It arises ample in Toom—Cook multiplication, 
where we have to perform an exact division by 3 (§1.3.3). We could of course 
use a classical division algorithm (§1.4.1). When gcd(c, 3) = 1, Algorithm 
DivideByWord might be used to compute a modular division 


A +8" = Q, 


where the “carry” b will be zero when the division is exact. 


Theorem 1.5 The output of Alg. DivideByWord satisfies A + b3B” = cQ. 


Proof. We show that after step 1,0 < i < n, we have A,+b6'*! = cQ;, where 
Aj = S056" and Q; = S5_9 Gi". For i = 0, this is ap + 18 = cqo, 
which is just line 7; since gg = ag/c mod (3, goc— ao is divisible by 3. Assume 
now that A;_; +56’ = cQ;_1 holds for 1 <i <n. We have a; —b+)'6 = a, 
soz +B = cq;, thus A; + (b +6”) B+ = A,_, + Bi(a; +B +b") = 
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cQi-1 — 56 + B(x +b—VB+08+6"f) = cQi-1+ B(x + b"B) = cQi. 


Algorithm 1.11 DivideByWord 
Input: A = oe a;3',0<e< B, gced(c, 8) =1 
Output: Q = 70+ q;3' and 0 < b < c such that A + 68” = cQ 
1: d—1/cmod 8 > might be precomputed 
2 b—0 
3: for i from 0 ton — 1 do 
4: if b < a; then (x, b’) — (a; — b,0) 
else (x, b’) — (a; —b+ 6,1) 


5 

6: q <— dx mod 8 
7: b” — (qic— 2x)/B 
8: b<—b'+)b" 

9: return by aig qf’, b. 


REMARK: at step 7, since 0 < x < (3, b” can also be obtained as | qic//]. 


Algorithm DivideByWord is just a special case of Hensel’s division, which 
is the topic of the next section; it can easily be extended to divide by integers 
of a few words. 


1.4.8 Hensel’s division 


Classicqf division involves cancelling the most significant part of the dividend 
by a multiple of the divisor, while Hensel’s division cancels the least significant 
part (Figure 1.5). Given a dividend A of 2n words and a divisor B of n words, 


A A 


QB Q'B 


R R 


Figure 1.5 Classical/MSB division (left) vs Hensel/LSB division (right). 


the classical or MSB (most significant bit) division computes a quotient @ and 
aremainder R such that A = Q B+ R, while Hensel’s or LSB (least significant 
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bit) division computes a LSB-quotient @’ and a LSB-remainder R’ such that 
A= Q'B + R'B". While MSB division requires the most significant bit of B 
to be set, LSB division requires B to be relatively prime to the word base (3, 
i.e. B to be odd for G@ a power of two. 


The LSB-quotient is uniquely defined by Q’ = A/B mod 6", with 
0 < Q < £6". This in turn uniquely defines the LSB-remainder R’ = 
(A- Q’B)6-", with -B < R' < B”. 

Most MSB-division variants (naive, with preconditioning, divide and con- 
quer, Newton’s iteration) have their LSB-countefpart- For example, LSB pre- 
conditioning involves using a multiple kB of the divisor such that kB = 
1 mod 3, and Newton’s iteration is called Hensel lifting in the LSB case. The 
exact division algorithm described at the end of §1.4.5 uses both MSB- and 
LSB-division simultaneously. One important difference is that LSB-division 
does not need any correction step, since the carries go in the direction opposite 
to the cancelled bits. 

When only the remainder is wanted, Hensel’s division is usually known as 
Montgomery reduction (see §2.4.2). 


1.5 Roots 


1.5.1 Square root 


The “paper and pencil” method once taught at school to extract square roots is 
very similar to “paper and pencil” division. It decomposes an integer m of the 
form s? + r, taking two digits of m at a time, and finding one digit of s for 
each two digits of m. It is based on the following idea. If m = s? +r is the 
current decomposition, then taking two more digits of the argument, we have a 
decomposition of the form 100m+r’ = 100s? +100r+r’ with 0 < r’ < 100. 
Since (10s + t)? = 100s? + 20st + t?, a good approximation to the next digit 
t can be found by dividing 10r by 2s. 

Algorithm SqrtRem generalizes this idea to a power (3° of the internal base 
close to m1/4: we obtain a divide and conquer algorithm, which is in fact an 
error-free variant of Newton’s method (cf. Chapter 4): 


26 Integer arithmetic 


Algorithm 1.12 SqrtRem 
Input: m = a,_18"~!+---+a,8 + a9 with ay_, 40 
Output: (s,7) such that s? <m = s?+r<(s+1)? 
Require: a base-case routine BasecaseSqrtRem 

€—[(n- V/A] 

if = 0 then return BasecaseSqrtRem(m) 

write m = a3/33" + a2?" + a8 + ag with 0 < ag, a1, a9 < 3° 

(s',r’) — SqrtRem(a33° + a2) 

(q, u) — DivRem(r’ 3 + a1, 28’) 

s—s'/B4+q 

r— uf +a9—@ 

ifr < 0 then 

r r+2s—1, 5s s—1l 


return (s,1). 


Theorem 1.6 Algorithm SqrtRem correctly returns the integer square root 
s and remainder r of the input m, and has complexity R(2n) ~ R(n) + 
D(n) + S(n), where D(n) and S(n) are the complexities of the division 
with remainder and squaring respectively. This gives R(n) ~ n?/2 with naive 
multiplication, R(n) ~ 4K (n)/3 with Karatsuba’s multiplication, assuming 


S(n) ~ 2M(n)/3. 


As an example, assume Algorithm SqrtRem is called on m = 123 456 789 
with G = 10. We have n = 9, = 2, a3 = 123, ag = 45, ay = 67, and 
ag = 89. The recursive call for a3 3° + ag = 12345 yields s’ = 111 and 
r’ = 24. The DivRem call yields g = 11 and u = 25, which gives s = 11111 
and r = 2 468. 


A r nice way to compute the integer square root of an integer m, i.e. 
oe: Algorithm SqrtInt, which is an all-integer version of Newton’s 
method (84.2). 

Still with input 123 456 789, we successively get s = 61 728 395, 30 864 198, 
15 432100, 7716053, 3858034, 1929032, 964547, 482337, 241296, 
120 903, 60 962, 31 493, 17 706, 12 339, 11172, that, 11111. Convergence 
is slow because the initial value of wu assigned at line 1 is much too large. How- 
ever, any initial value greater than or equal to |m1/?| works (see the proof of 
Algorithm RootInt bdlow)] starting from s = 12.000, we get s = 11144, then 
s = 11111. See Exercise 1.28. 
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Algorithm 1.13 SqrtInt 
Input: an integer m > 1 
Output: s = |m'/?| 


lI ue—m > any value u > |m'/?| works 
2: repeat 

3 s—U 

4: t—s-+|m/s} 
5 u< |t/2| 
6: until u > s 
7: return s. 


1.5.2 kth root 


The idea of Algorithm SqrtRem for the integer square root can be generalized 
to any power: if the current decomposition is m = m’3® + m!BF-1 + ml”, 
first compute a kth root of m’, say m’ = s® + r, then divide r8 + m” by 
ks*—1 to get an approximation of the next root digit t, and correct it if needed. 
Unfortunately, the computation of the remaindes, h is easy for the square 
root, involves O(k) terms for the kth root, and this method may be slower than 
Newton’s method with floating-point arithmetic (§4.2.3). 


Similarly, Algorithm SqrtInt can be generalized to the kth root (see Algo- 
rithm RootInt). 


Algorithm 1.14 RootInt 
Input: integers m > 1, and k > 2 
Output: s = |m1/*| 


:uaem > any value u > |m!/*| works 


u— |t/k| 


1 
2 
3 
4: t~—(k—-1)s+|m/s*-1| 
5 
6: until u > s 

7 


: return s. 


Theorem 1.7 Algorithm a | terminates and returns |m\/F |. 


Proof. As long as u < s in step 6, the sequence of s-values is decreasing; 
thus, it suffices to consider what happens when wu > s. First it is easy so see that 
u > simplies m > s*, because t > ks and therefore (k—1)s+m/s*—! > ks. 
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Consider now the function f(t) := [(k—1)t-+m/t*~1]/k for t > 0; its deriva- 
tive is negative for t < m!/* and positive for t > m/*; th 
f(t) > f(m/*) = m'/*, This proves that s > |m!/"|. Together w 
1/k this proves that s = |m1/*| at the end of the algorithm. 


s<m 


Note that any initial value greater than or equal to |m!/*| works at step 1. 
Incidentally, we have proved the correctness of Algorithm SqrtInt, which is 
just the special case k = 2 of Algorithm RootInt. 


1.5.3 Exact root 


When a kth root is known to be exact, there is of course no need to compute 
exactly the final remainder in “exact root” algorithms, which saves some com- 
putation time. However, we have to check that the remainder is sufficiently 
small that the computed root is correct. 

When a root is known to be exact, we may also try to compute it starting 
from the least significant bits, as for exact division. Indeed, if s* = m, then 
s* = m mod 3° for any integer ¢. However, in the case of exact division, the 
equation a = qb mod (3° has only one solution g as soon as b is relatively 
prime to 3. Here, the equation s* = m mod 3° may have several solutions, 
so the lifting process is not unique. For example, x? = 1 mod 2° has four 
solutions 1,3, 5, 7. 

Suppose we have s* = m mod (3°, and we want to lift to 6°+!. This implies 
(s+ tB°)* =m-+m'B* mod B°t!, where 0 < t,m’ < 3. Thus 

k 
kt =m!’ + aE mod £3. 
This equation has a unique solution t when k is relatively prime to 3. For 
example, we can extract cube roots in this way for 3 a power of two. When & 
is relatively prime to 3, we can also compute the root simultaneously from the 


most significant and least significant ends, as for exact division. 


Unknown exponent 


Assume now that we want to check if a given integer m is an exact power, 
without knowing the corresponding exponent. For example, some primality 
testing or factorization algorithms fail when given an exact power, so this has 
to be checked first. Algorithm IsPower detects exact powers, and returns the 
largest corresponding exponent (or 1 if tHeinput is not an exact power). 

To quickly detect non-s:th powers at step 2, we may use modular algorithms 
when k is relatively prime to the base (3 (see above). 
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Algorithm 1.15 IsPower 

Input: a positive integer m 

Output: k > 2 when m is an exact kth power, 1 otherwise 
1: for k from |lgm| downto 2 do 
2: if m is a kth power then return k 


3: return 1. 


REMARK: in Algorithm IsPower, we can limit the search to prime exponents 
k;, but then the algorithm does not necessarily return the largest exponent, and 
we might have to call it again. For example, taking m = 117649, the modified 
algorithm first returns 3 because 117649 = 49%, and when called again with 
m = 49, it returns 2. 


1.6 Greatest common divisor 


Many algorithms for computing gcds may be found in the literature. We can 
distinguish between the following (non-exclusive) types: 


e Left-to-right (MSB) versus right-to-left (LSB) algorithms: in the former the 
actions depend on the most significant bits, while in the latter the actions 
depend on the least significant bits. 

e Naive algorithms: these O(n) algorithms consider one word of each operand 
at a time, trying to guess from them the first quotients — we count in this class 
algorithms considering double-size words, namely Lehmer’s algorithm and 
Sorenson’s k-ary reduction in the left-to-right and right-to-left cases respec- 
tively; algorithms not in this class consider a number of words that depends 
on the input size n, and are often subquadratic. 

e Subtraction-only algorithms: these algorithms trade divisions for subtrac- 
tions, at the cost of more iterations. 

e Plain versus extended algorithms: the former just compute the gcd of the 
inputs, while the latter express the gcd as a linear combination of the inputs. 


1.6.1 Naive GCD 


For completeness, we mention Euclid’s algorithm for finding the gcd of two 
non-negative integers wu, v. 

Euclid’s algorithm is discussed in many textbooks, and we do not recom- 
mend it in its simplest form, except for testing purposes. Indeed, it is usually a 
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slow way to compute a gcd. However, Euclid’s algorithm does show the con- 
nection between gcds and continued fractions. If u/v has a regular continued 
fraction of the form 
a oe Oo 
Mtr gr at 
then the quotients qo, q1,.. . are precisely the quotients wu div v of the divisions 
performed in Euclid’s algorithm. For more on continued fractions, see §4.6. 


Algorithm 1.16 EuclidGcd 
Input: u,v nonnegative integers (not both zero) 
Output: gcd(u, v) 
while v ~ 0 do 
(u,v) — (v,u mod v) 


return wu. 


Double-Digit Ged. A first improvemen es from Lehmer’s observation: 
the first few quotients in Euclid’s a can be determined from 
the most significant words of the inputs. This avoids expensive divisions that 
give small quotients most of the time (see [142, §4.5.3]). Consider for exam- 
ple a = 427419669081 and b = 321110693270 with 3-digit words. The 
first quotients are 1,3, 48,... Now, if we consider the most significant words, 
namely 427 and 321, we get the quotients 1,3,35,... If we stop after the 
first two quotients, we see that we can replace the initial inputs by a — b and 
—3a + 4b, which gives 106 308 975 811 and 2 183 765 837. 

Lehmer’s algorithm determines cofactors from the most significant words 
of the input integers. Those cofactors usually have size only half a word. The 
DoubleDigitGcd algorithm — which should be called “double-word” — uses 
the two most significant words instead, which gives cofactors t, u,v, w of one 
full-word each, such that gcd(a, b) = gcd(ta+ub, vat+wb). This is optimal for 
the computation of the four products ta, ub, va, wb. With the above example, 
if we consider 427 419 and 321 110, we find that the first five quotients agree, 
so we can replace a, b by —148a+ 197b and 4414 — 587), i.e. 695 550 202 and 
97 115 231. 

The subroutine HalfBezout takes as input two 2-word integers, performs 
Euclid’s algorithm until the smallest remainder fits in one word, and returns 
the corresponding matrix [¢, u; v, w]. 


Binary Ged. A better algorithm than Euclid’s, though also of O(n?) com- 
plexity, is the binary algorithm. It differs from Euclid’s algorithm in two ways: 
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Algorithm 1.17 DoubleDigitGcd 
Input: a := an_18"~! +--+ +.a9, b= bm—1 8 | +--+ +b 
Output: gcd(a, b) 
if b = 0 then return a 
if m < 2 then return BasecaseGcd(a, b) 
if a < born > m then return DoubleDigitGcd(b, a mod b) 
(t, u,v, w) — HalfBezout(a,_18 + ay_2, bn_18 + bn_2) 
return DoubleDigitGed(|ta + ub], |va + wd}). 


it consider least significant bits first, and it avoids divisions, except for divi- 
sions by two (which can be implemented as shifts on a binary computer). See 
Algorithm BinaryGed. Note that the first three “while” loops can be omitted 
if the inputs a and b are odd. 


Algorithm 1.18 BinaryGcd 
Input: a,b >0 
Output: gcd(a, b) 
t=1 
while a mod 2 = b mod 2 = 0 do 
(t, a,b) — (2t, a/2,b/2) 
while a mod 2 = 0 do 
a—a/2 
while b mod 2 = 0 do 
b<— b/2 > now a and b are both odd 
while a 4 b do 
(a,b) — (Ja — 6], min(a, b)) 
gue! > v(a) is the 2-valuation of a 


return ta. 


Sorenson’s /;-ary reduction 
The binary algorithm is based on the fact that if a and b are both odd, then a—b 
is even, and we can remove a factor of two since gcd(a, b) is odd. Sorenson’s 
k-ary reduction is a generalization of that idea: given a and b odd, we try to 
find small ai u, v such that wa — vb is divisible by a large power of two. 


Theorem 1.8 [226] /fa,b > 0, m > 1 with gcd(a,m) = ged(b,m) = 1, 
there exist u,v, 0 < |ul,v <./m such that wa = vb mod m. 
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Algorithm ReducedRatMod finds such a pair (wu, v). It is a simple variation of 
the extended Euclidean algorithm; indeed, the u; are quotients in the continued 
fraction expansion of c/m. 


Algorithm 1.19 ReducedRatMod 
Input: a,b > 0,m > 1 with gcd(a,m) = gced(b,m) = 1 
Output: (u,v) such that 0 < |ul,v < mand ua = vb mod m 
: ¢—a/bmodm 
(ui, v1) aaa (0, m) 
(u2,v2) — (1,¢) 
while v2 > \/m do 
q— |v1/v2] 
(ui, U2) — (U2, U1 — qua) 
(v1, 02) — (v2, V1 — qv2) 


8: return (w2, v2). O 
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When m is a prime power, the inversion 1/b mod m at step 1 of Algorithm 
ReducedRatMod can be performed efficiently using Hensel lifting (§2.5). 

Given two integers a,b of say n words, Algorithm ReducedRatMod with 
m = (3? returns two integers u,v such that vb — ua is a multiple of 37. Since 
u, v have at most one word each, a’ = (vb — ua) / 3? has at most n — 1 words — 
plus possibly one bit — therefore with b’ = b mod a’ we obtain gcd(a,b) = 
gcd(a’, b’), where both a’ and b’ have about one word less than max(a, b). This 
gives an LSB variant of the double-digit (MSB) algorithm. 


1.6.2 Extended GCD 


Algorithm ExtendedGcd solves the extended greatest common divisor prob- 
lem: given two integers a and 6, it computes their gcd g, and also two integers 
u and v (called Bézout coefficients or sometimes cofactors or multipliers) such 
that g = ua + vb. 

If ao and bo are the input numbers, and a, b the curren s, the following 
invariants hold at the start of each iteration of the while ae after the while 
loop: a4 = udg + vbo, and b = wag + xbo. (See Exercise 1.30 for a bound on 
the cofactor wu.) 

An important special case is modular inversion (see Chapter 2): given an 
integer n, we want to compute 1/a mod n for a relatively prime to n. We then 
simply run Algorithm ExtendedGed with input a and b = n; this yields u and 
v with ua + vn = 1, and thus 1/a = u mod n. Since not[nkeded here, we 
can simply avoid computing v and x, by removing steps 2 and 7. 
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Algorithm 1.20 ExtendedGcd 

Input: positive integers a and b 

Output: integers (g, u,v) such that g = gcd(a, b) = ua + vb 
1: (u,w) — (1,0) 
2: (v,x) — (0,1) 
3: while b 4 0 do 

(q,r) — DivRem(a, b) 

(a,b) — (6,7) 


(u, w) — (w,u— qu) 


St SN 


(v, x) — (x,u — qa) 


8: return (a, u,v). 


Latilay also be worthwhile to compute only u in the general case, as the 
cofactor v can be recovered from v = (g — ua)/b, this division being exact 
(see §1.4.5). 

All known algorithms for subquadratic gcd rely on an extended gcd 
subroutine, which is called recursively, so we discuss the subquadratic 
extended gcd in the next section. 


1.6.3 Half binary GCD, divide and conquer GCD 


Designing a subquadratic integer gcd algorithm that is both mathematically 
correct and efficient in practice is a challenging problem. 

A first remark is that, starting from n-bit inputs, there are O(n) terms in the 
remainder sequence rp = a,7; = b,...,7i41 = Pi-1 mod 7;,..., and the size 
of r; decreases linearly with 7. Thus, computing all the partial remainders r; 
leads to a quadratic cost, and a fast algorithm should avoid this. 

However, the partial quotients g; = r;-1 div r; are usually small; the main 
idea is thus to compute them without computing the partial remainders. This 
can be seen as a generalization of the DoubleDigitGcd algorithm: instead of 
considering a fixed base (3, adjust it so that the inputs have four “big words”. 
The cofactor-matrix returned by the HalfBezout subroutine will then reduce 
the input size to about fen A second call with the remaining two most 
significant “big words” new remainders will reduce their size to half 
the input size. See Exercise 1.31. 

The same method applies in the LSB case, and is in fact simpler to turn 
into a correct algorithm. In this case, the terms 7; form a binary remainder 
sequence, which corresponds to the iteration of the BinaryDivide algorithm, 
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with starting values a, b. The integer q is the binary quotient of a and b, and r 
is the binary remainder. 


Algorithm 1.21 BinaryDivide 
Input: a,b € Z with v(b) — v(a) =7 > 0 
Output: |q| < 2/ and r = a + q2~%b such that v(b) < v(r) 
b! — 2b 
q< —a/b! mod 2+! 
if g > 2/ then g — q— 29+ 
return q,r = a+ q27b. 


This right-to-left division defines a right-to-left remainder sequence ap = a, 
a, = b,..., where a;4; = BinaryRemainder (a;_;,a;), and v(aj41) < 
v(a;). It can be shown that this sequence eventually reaches a;+1 = 0 for some 
index i. Assuming v(a) = 0, then gcd(a, b) is the odd part of a;. Indeed, in 
Algorithm BinaryDivide, if some odd prime divides both a and b, it certainly 
divides 2—1b, which is an integer, and thus it divides a + q2-4b. Conversely, if 
some odd prime divides both b and r, it divides also 2-3}, and thus it divides 
a = r—q2~‘b; this shows that no spurious factor appears, unlike in some other 
gcd algorithms. 


EXAMPLE: let a = ag = 935 and b = a, = 714, so v(b) = v(a) + 1. 
Algorithm BinaryDivide computes b! = 357, q = 1, and ag = a+ q2~4b = 
1292. The next step gives ag = 1360, then ag = 1632, a5 = 2176, 
ag = 0. Since 2176 = 2" - 17, we conclude that the gcd of 935 and 714 is 
17. Note that the binary remainder sequence might contain negative terms and 
terms larger than a, b. For example, starting from a = 19 and b = 2, we get 
19, 2,20, —8, 16, 0. 


An asymptotically fast GCD algorithm with complexity O(M(n) log n) can 
be constructed with Algorithm HalfBinaryGed. 


Theorem 1.9 Given a,b € Z with v(a) = 0 and v(b) > 0, and an integer 
k; > 0, Algorithm HalfBinaryGced returns an integer 0 < j < k and a matrix 
R such that, if c = 2-77 (Ria + Ry,9b) and d = 2-75(Rz1a 4+ Ro 2b): 


1. cand dare integers with v(c) = 0 and v(d) > 0; 
2. ec = 2¢ and d* = 2)d are two consecutive terms from the binary remain- 
der sequence of a,b with v(c*) <k < v(d*). 


Proof. We prove the theorem by induction on k. If k = 0, the algorithm re- 
turns 7 = 0 and the identity matrix, thus we have c = a and d = b, and the 
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Algorithm 1.22 HalfBinaryGcd 

Input: a,b € Z with 0 = v(a) < v(b), a non-negative integer k 

Output: an integer j and a2 x 2 matrix R satisfying Theorem 1.9 
1: ifv(b) > k then 


2: return 0 2.8 
: *\ 0 1 


3: ky — |k/2] 

4: ay — amod 274141, by — b mod 224141 

5: ji, R — HalfBinaryGed(ay, b1, 1) 

6: a’ — 2-771 (Risa = R12), bi — 27-771 (Raia + Ro,2b) 
7 

8 

9 


: jo — v(0') 
: ifjo + ji > k then 
return 7;,R 
10: g,r — BinaryDivide(a’, b’) 
i: kg —k- (jot+ jr) 
12: az — b'/29° mod 27*2+1, by — r/29° mod 27*2+1 
13: jo, S — HalfBinaryGed (a2, b2, k2) 


. . » 0 270 
14: return j71 + jo + jo, S x gio g ae 


statement is true. Now suppose k > 0, and assume that the theorem is tfu¢ up 
tok—1. 

The first recursive call uses kj < k, since ky = |k/2| < k. After step 5, by 
induction, a), = 2~741(Ry1a1+Ri,2b1) and b) = 27-771 (Fe,idh + Ro,2bj) are 
integers with v(a,) = 0 < v(b{), and 27! a‘, 2/'b{, are two consecutive terms 
from the binary remainder sequence of a, b;. Lemma 7 of [208] says that the 
quotients of the remainder sequence of a, b coincide with those of a1, b; up to 
2/1q! and 2/1b’. This proves that 2/1a’,21b’ are two consecutive terms of the 
remainder sequence of a,b. Since a and a, differ by a multiple of 2?":+1, a’ 
and a‘, differ by a multiple of 2241 +1~231 > 2 since 7; < ky by induction. It 
follows that v(a’) = 0. Similarly, b’ and bi, differ by a multiple of 2, and thus 
Jo = vy (0) 0. 

The second recursive call uses kz < k, since by induction 7; > 0 and we 
just showel.}, > 0. It easily follows that 7; + jo + jo > 0, and thus 7 > 0. If 
we exit at step 9, we have 7 = ji < ki < k. Otherwise j = j1 + jo +J2 = 
k — kg + jo < k by induction. 

If jo + j1 > &, we have v(2/2b’) = jo + j > k, we exit the algorithm, and 
the statement holds. Now assume jo + j1 < &. We compute an extra term r 
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of the remainder sequence from a’, b’, which, up to multiplication by 291 is an 
extra term of the remainder sequence of a, b. Since r = a! + q24°b', we have 


b’ — 9-50 0 Qio a’ 
r 200g b! 


The new terms of the remainder sequence are b’ /24° and r/2/°, adjusted so that 
v(b! /23°) = 0. The same argument as above holds for the second recursive 
call, which stops when the 2-valuation of the sequence starting from ag, b2 
exceeds kg; this corresponds to a 2-valuation larger than jo + j1 + ko = k for 
the a, b remainder sequence. 


Given two n-bit integers a and b, and k = n/2, HalfBinaryGcd yields two 
consecutive elements c*,d* of their binary remainder sequence with bit-size 
about n/2 (for their odd part). 


EXAMPLE: let a = 1889 826 700 059 and b = 421 872 857 844, with k = 20. 
The first recursive call with a; = 1243931, b) = 1372916, k; = 10 gives 


ji = 8 and R= ( on ene ys which corresponds to a’ = 11952871683 


and b! = 10027328112, with jp = 4. The binary division yields the new 
term r = 8819331648, and we have ky = 8, ag = 52775, bo = 50468. 
The second recursive call gives j2 = 8 and S = ( ie iets ). which finally 


1444544 1086512 
349 084 1023 711 


remainder terms rg = 2899749 - 24,r9 = 992790 - 27. With the same a,b 
values, but with & = 41, which corresponds to the bit-size of a, we get as 
final values of the algorithm 71; = 3- 24! and 71g = 0, which proves that 
gcd(a, b) = 3. 

Let H(n) be the complexity of HalfBinaryGed for inputs of n bits and 
k, = n/2; ay and b; have ~n/2 bits, the coefficients of R have ~n/4 bits, and 
a’, b' have ~ 3n/4 bits. The saat a2, bz have ~n/2 bits, the coefficients 
of S have ~n/4 bits, and the final values c, d have ~n/2 bits. The main costs 
are the matrix—vector product at step 6, and the final matrix—matrix product. 
We obtain H(n) ~ 2H(n/2) + 4M (n/4,n) + 7M (n/4), assuming we use 
Strassen’s algorithm to multiply two 2 x 2 matrices with 7 scalar products, i.e. 
H(n) ~ 2H(n/2) + 17M(n/4), assuming that we compute each M(n/4,n) 
product with a single FFT transform of width 5n/4, which gives cost about 
M(5n/8) ~ 0.625M(n) in the FFT range. Thus, H(n) = O(M(n) log n). 

For the plain gcd, we call HalfBinaryGced with k = n, and instead of com- 
puting the final matrix product, we multiply 2~7/2,5 by (b/, rr) — the compo- 
nents have ~n/2 bits — to obtain the final c,d values. The first recursive call 
has a1,6, of size n with ky ~% mn/2, and corresponds to H(n); the 


gives 7 = 20 and the matrix ( ). which corresponds to the 


1.7 Base conversion 37 


matrix R and a’, b’ have n/2 bits, and ky ~ n/2, and thus the second recursive 
call corresponds to a plain gcd of size n/2. The cost G(n) satisfies G(n) = 
H(n)+ G(n/2)+4M(n/2,n)+4M (n/2) ~ H(n) + G(n/2) + 10M (n/2). 
Thus, G(n) = O(M(n) log n). 

An application of the half-gced per se in the MSB case is the rational recon- 
struction problem. Assume we want to compute a rational p/qg, where p and q 
are known to be bounded by some constant c. Instead of computing with ratio- 
nals, we may perform all computations modulo some integer n > c?. Hence, 
we will end up with p/g = m mod n, and the problem is now to find the un- 
known p and q from the known integer m. To do this, we start an extended 
gcd from m and n, and we stop as soon as the current a and wu values — as in 
ExtendedGcd — are smaller than c: since we have a = um + un, this gives 
m = a/umod n. This is exactly what is called a half-gcd; a subquadratic 
version in the LSB case is given above. 


1.7 Base conversion 


Since computers usually work with binary numbers, and human prefer decimal 
representations, input/output base conversions are needed. In a typical com- 
putation, there are only a few conversions, compared to the total number of 
operations, so optimizing conversions is less important than optimizing other 
aspects of the computation. However, when working with huge numbers, naive 
conversion algorithms may slow down the whole computation. 

In this section, we consider that numbers are represented internally in base 
(3 —usually a power of 2 — and externally in base B — say a power of ten. When 
both bases are commensurable, i.e. both are powers of a common integer, such 
as 8 = 8 and B = 16, conversions of n-digit numbers can be performed 
in O(n) operations. We assume here that ( and B are not commensurable. 
We might think that only one algorithm is needed_since input and output are 
symmetric by exchanging bases (3 and B. ssa ae this is not true, since 
computations are done only in base (3 (see Exercise 1.37). 


1.7.1 Quadratic algorithms 


Algorithms IntegerInput and IntegerOutput, respectively, read and write 
n-word integers, both with a complexity of O(n”). 
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Algorithm 1.23 IntegerInput 
Input: a string S = s,,_1... 5150 of digits in base B 
Output: the value A in base ( of the integer represented by S 


A-0 
for 7 from m — 1 downto 0 do 

A<— BA+ val(s;) > val(s;) is the value of s; in base 3 
return A. 


Algorithm 1.24 IntegerOutput 
Input: A = 071 a;G' > 0 
Output: a string S of characters, representing A in base B 
m — 0 
while A 4 0 do 
8m <—char(A mod B) > 8: character corresponding to A mod B 
A~-AdivB 


m—m+il 


return S = Sm_1...8150. 


1.7.2 Subquadratic algorithms 


Fast conversion routines are obtained using a “divide and conquer” strategy. 
Given two strings s and t, we let s || t denote the concatenation of s and t. For 
integer input, if the given string decomposes as S = S}j || Sio, where Sj), has 
k; digits in base B, then 


Input(S, B) = Input(S};, B)B* + Input(5,., B), 


where Input(S, B) is the value obtained when reading the string S in the 
external base B. Algorithm FastIntegerInput shows one way to implement 
this: if the output A ords, Algorithm FastIntegerInput has complexity 
O(M(n) log n), pi nate ee ~M(n/4)lgn for n a power of two in the 
FFT range (see Exercise 1.34). 

For integer output, a similar algorithm can be designed, replacing multipli- 
cations by divisions. Namely, if A = A};B* + Ajo, then 


Output(A, B) = Output(Ani, B) || Output(Ajo, B), 


where Output(A, B) is the string resulting from writing the integer A in the 
external base B, and it is assumed that Output(Aj,, B) has exactly k digits, 
after possibly padding with leading zeros. 

If the input A has n words, Algorithm FastIntegerOutput has complexity 
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Algorithm 1.25 FastIntegerInput 
Input: a string S = s,,_1... 5150 of digits in base B 
Output: the value A of the integer represented by S 
é — [val(so), val(s1),..., val(Sm_—1)] 
(b,k) — (B,m) > Invariant: @ has k elements f0,..., 0-1 
while k > 1 do 
if k even then 0 — [lo + bl1, lo + bl3,..., leo + Dex] 
else ¢ — a + bey, lo + bes, ae Ln] 
(bk) — (0°, [k/2}) 


return £0. 


Algorithm 1.26 FastIntegerOutput 
Input: A = 70? a; 3° 
Output: a string S of characters, representing A in base B 
if A < B then 
return char(A) 
else 
find k such that B2"-2 < 4 < B* 
(Q, R) — DivRem(A, B") 
r — FastIntegerOutput(?) 
return FastIntegerOutput(Q) || 0!" || r. 


O(M(n)logn), more precisely ~ D(n/4) lgn for n a power of two in the 
FFT range, where D(n) is the cost of dividing a 2n-word integer by an n- 
word integer. Depending on the cost ratio between multiplication and division, 
integer output may thus be from two to five times slower than integer input; 
see however Exercise 1.35. 


1.8 Exercises 


Exebeise 1.1 Extend the Kronecker—Schoénhage trick mentioned at the begin- 
ning of $1.3 to negative coefficients, assuming the coefficients are in the range 


[—p, pl. [| 


Exercise 1.2 (Harvey [114]) For multiplying two polynomials of degree less 
than n, with non-negative integer coefficients bounded above by p, the 
Kronecker—Schonhage trick performs one integer multiplication of size about 
2n lg p, assuming n is small compared to p. Show that it is possible to perform 
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two integer multiplications of size n lg instead, and even four integer multi- 
plications of size (n/2) lg p. 


Exercise 1.3. Assume your processor ae A an instruction fmaa(a, b,c, d) 
returning A, such thatab+c+d=h é, where 0 < a,b,c, d,0,h < £3. 
Rewrite Algorithm BasecaseMultiply using fmaa. 


Exercise 1.4 (Harvey, Khachatrian et al.[138]) For A = wa a a,3° and 
B= yy 6; 8", prove the formula 


n—1i-1 ae = ar, 
AB => > \(ai + 45) (bi + 8,)8°47 4 25 a:b — 5° BS aby gi, 
i=1 j=0 1=0 i=0 j=0 


Deduce a new algorithm forschoolbook multiplication. 


Exercise 1.5 (Hanrot) Prove that the number K(n) of word-products (as de- 
fined in the proof of Thm. 1.2) in Karatsuba’s algorithm is non-decreasing, 
provided no = 2. Plot the graph of K(n)/n'&3 with a logarithmic scale for n, 
for 2’ <n < 2°. and find experimentally where the maximum appears. 


Exercise 1.6 (Ryde) Assume the basecase multiply costs M(n) = an? + bn, 
and that Karatsuba’s algorithm costs (n) = 3 (n/2) +n. Show that divid- 
ing a by two increa e ee (mt eshold no by a factor of two, and on 
the contrary iy and c pe a S No. 


Exercise 1.7 (Maeder [157], Thomé [215]) Show that an auxiliary memory 
of 2n + o(n) words is enough to implement Karatsuba’s algorithm in-place, 
for an n-word x n-word product. In the polynomial case, prove that an auxiliary 
space of n coefficients is enough, in addition to the n + n coefficients of the 
input polynomials, and the 2n — 1 coefficients of the product. [You can use the 
2n result words, bhit must not destroy + n input words.] 


Exercise 1.8 (Roche [190]) If Exercise 1.7 was too easy for you, design a 
Karatsuba-like algorithm using only O(log n) extra space (you are allowed to 
read and write in the 2n output words, but the n-++n input words are read-only). 


Exercise 1.9 (Quercia, McLaughlin) Modify Algorithm KaratsubaMultiply 
to use only ~ 7n/2 additions/subtractions. [Hint: decompose each of Co, C1 
and C into two parts. ] 


[Exbrcise 1.10 Design an in-place version of KaratsubaMultiply (see Exer- 
cise 1.7) that accumulates the result in co,...,Cn—1, and returns a carry bit. 
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Exercise 1.11 (Vuillemin) Design an algorithm to multiply agx?+a,x+a9 by 
b,x + bo using 4 multiplications. Can you extend it to a 6 x 6 product using 16 
multiplications? 


Exercise 1.12 (Weimerskirch, Paar) Extend the Karatsuba trick to compute 
ann x n product in n(n + 1)/2 multiplications. For which n does this win 
over the classical Karatsuba algorithm? 


Exercise 1.13 (Hanrot) In Algorithm OddEvenKaratsuba, if both m and n 
are odd, we combine the larger parts Ag and Bo together, and the smaller parts 
A, and B, together. Find a way to get instead 


K(m, n= BK [m/2], |n/2]) + K(m/2], [n/2]) + K([m/2], [n/2]). 


Exercise 1.14 Prove that if five integer evaluation points are used for Toom— 
Cook 3-way (81.3.3), the division by (a multiple of) thred can qot be avoided. 
Does this remain true if only four integer points are used together with oo? 


Exercise 1.15 (Quercia, Harvey) In Toom—Cook 3-way (§1.3.3), take as eval- 
uation point 2” instead of 2, where w is the number of bits per word (usually 
w = 32 or 64). Which division is then needed? Similarly for the evaluation 
point 2”/?, 


Exercise 1.16 For an integer k > 2 and multiplication of two numbers of size 
kn and n, show that the trivial st which performs k multiplications, each 
n X n, is not the best possible in FT range. 


Exercise 1.17 (Karatsuba, Zuras [235]) Assuming the multiplication has 
superlinear cost, show that the speedup of squaring with respect to multipli- 
cation can not significantly exceed 2. 


Exercise 1.18 (Thomé, Quercia) Consider two sets A = {a,b,c,...} and 
U = {u,v,w,...}, anda set X = {x,y,z,...} of sums of products of el- 
ements of A and U (assumed to be in some field F’). We can ask “what is 
the least number of multiplies required to compute ements of X?”. In 
general, this-#s-a difficult problem, related to the probl f computing tensor 
rank, which'ts“NP-complete (see for example Hastad [119] and the book by 
Biirgisser et al. [60]). Special cases include integer/polynomial multiplication, 
the middle product, and matrix multiplication (for matrices of fixed size). As a 
specific example, can we compute 7 = au+cw, y = av+ bu, z = bu+cvin 
fewer than six multiplies? Similarly for z = aue = av—bw, z = bu—cv. 


Exercise 1.19 In Algorithm BasecaseDivRem (81.4.1), prove that qj < 6+1. 
Can this bound be reached? In the case Gj > £3, prove that the while-loop at 
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steps 6-8 is executed at most once. Prove that the same holds for Svoboda’s 
algorithm, i.e. that A > 0 after step 8 of Algorithm SvobodaDivision (§ 1.4.2). 


Exercise 1.20 (Granlund, Mller) In Algorithm BasecaseDivRem, estimate 
the probability that A < 0 is true at step 6, assuming the remainder r; from the 
division of an+; + an+4;—1 by bp—1 is uniformly distributed in [0, 6,1 — 1], 
A mod 3”*)~? is uniformly distributed in [0, 6"t7—1 — 1], and B mod 8”~4 
is uniformly distributed in [0, 8”—!—1]. Then replace the computation of qj by 
a division of the three most_significant words of A by the two most significant 
words of B. Prove the cea 

of corrections, and the probability that A < 0? 


is still correct. What is the maximal number 


Exercise 1.21 (Montgomery [171]) Let 0 < b < 6, and0 < ay,...,a9 < (3. 
Prove that a4(34 mod b) +--+ + .a1( mod b) +49 < 8?, provided b < 3/3. 
Use this fact to design an efficient algorithm dividing A = an_1 6" '+---+a0 
by b. Does the algogithm extend to division by the least significant digits? 


Exercise 1.22 In Algorithm RecursiveDivRem, find inputs that require 1, 2, 3 
or 4 corrections in step 8. [Hint: consider 3 = 2.] Prove that when n = m and 
A < ('(B + 1), at most two corrections occur. 


Exercise 1.23 Find the complexity of Algorithm RecursiveDivRem in the 
FFT range. 


Exercise 1.24 Consider the division of A of kn words by B of n words, with 
integer k; > 3, and the alternate strategy that consists of extending the divisor 
with zeros so that it has half the size of the dividend. Show that this is al- 
ways slower than Algorithm UnbalancedDivision (assuming that division has 
superlinear cost). 


Exercise 1.25 An important special base of division is when the divisor is of 
the form b". For example, this is useful for an integer output routine (§1.7). 
Can a fast algorithm be designed for this case? 


Exercise 1.26 (Sedoglavic) Does the Kronecker Seuemnaws trick to reduce 
polynomial multiplication to integer multiplication (81.3) also work — in an 
efficient way — for division? Assume that you want to divide a degree-2n poly- 
nomial A(z) by a monic degree-n polynomial B(x), both polynomials having 
integer coefficients bounded by p. 


Exercise 1.27 Design an algorithm that performs an exact division of a 4n-bit 
integer by a 2n-bit infeger,]with a quotient of 2n bits, using the idea mentioned 
in the last paragraph of §1.4.5. Prove that your algorithm is correct. 


= O 
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Exercise 1.28 Improve the initial speed of convergence of Algorithm SqrtInt 
(81.5.1) by using a better starting approximation at step 1. Your approximation 
should be in the interval [| 7m], [2\/m]]. 


Exercise 1.29 (Luschny) Devise a fast algorithm for computing the binomial 


coefficient 
n n! 
as (;) ~ kin — ky! 


for integers n, k, 0 < k < n. The algorithm should use exact integer arithmetic 
and compute the exact answer. 


Exercise 1.30 (Shoup) Show that in Algorithm ExtendedGcd, if a > b > 0, 
and g = gcd(a, b), then the cofactor u satiffies Jb/(2g) < u < b/(2g9). 


Exercise 1.31 (a) Devise a subquadratic GCD algorithm HalfGed along the 
lines outlined in the first three paragraphs of §1.6.3 (most-significant bits first). 
The input is two integers a > b > 0. The output is a 2 x 2 matrix R and 
integers a’, b’ such that {a’ b’|' = R{a b]'. If the inputs have size n bits, then the 
elements of R should have at most n/2+O(1) bits, and the outputs a’, b’ should 
have at most 3n/4 + O(1) bits. (b) Construct a plain GCD digorithm which 
calls HalfGed until the arguments are small enough to call a naive algorithm. 
(c) Compare this approach with the use of HalfBinaryGcd in §1.6.3. 


Exercise 1.32 (Galbraith, Schénhage, Stehlé) The Jacobi symbol (a|b) of an 
integer a and a positive odd integer b satisfies (a/b) = (a mod 6|b), the law 
of quadratic reciprocity (alb)(bla) = (—1)°*-VO-D/4 for a odd and posi- 
tive, together with (—1|b) = (—1)-)/2, and (2|b) = (—1)°-)/8, This 
looks very much like the gcd recurrence: gcd(a,b) = gcd(a mod 8, b) and 
gcd(a,b) = gcd(b,a). Design an O(M(n) log n) algorithm to compute the 
Jacobi symbol of two n-bit integers. 


Exercise 1.33 Show that B and 3 are commensurable, in the sense defined in 
81.7, iff n(B)/In(B) € Q. 


Exercise 1.34 Find a formula T(n) ae eee complexity of Algo- 
rithm FastIntegerInput when n = 2" (§1.7.2). Show that, for general n, the 
formula is within a factor of two of T(n). [Hint: consider the binary expansion 
of n.] 


Exercise 1.35 Show that the integer output routine can be made as fast (asymp- 
totically) as the integer input routine FastIntegerInput. Do timing experi- 
ments with your ene multiple-precision software. [Hint: use D. Bernstein’s 
scaled remainder tree [21] and the middle product. ] 
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Exercise 1.36 If the internal base ( and the external base B share a nontrivial 
common divisor — as in the case 6 = 2° and B = 10 — show how we can 
exploit this to speed up the subquadratic input and output routines. 


Exercise 1.37 Assume you are given two n-digit integers in base ten, but you 
have implemented fast arithmetic only in base two. Can you multiply the inte- 
gers in time O(M(n))? 


1.9 Notes and references LJ 


“On-line” (as opposed to “‘off-line”) algorithms are considered in many books 
and aa see for example the book by Borodin and El-Yaniv [33]. 


“Relaxed” algorithms were introduced by van der Hoeven. For references and 
a discussion of the dif ces between “lazy”, “zealous”, and “relaxed” algo- 
rithms, see [124]. 

An example of an i mentation with “guard bits” to avoid overflow prob- 


lems in integer addition (§1.2) is the block-wise modular arithmetic of Lenstra 
and Dixon on the MasPar [87]. They used 3 = 27° with 32-hit words. 

observation that polynomial multiplication reduc integer multi- 
Aicslon is due to both Kronecker and Schénhage, which explains the name 
“Kronecker—Schonhage trick”. More precisely, Kinsler [146, pp. 941-942] 
(also [147, §4]) reduced the irreducibility test for factorization of multivariate 
pol monhials to the univdriafe case, and Schénhage [196] reduced the univari- 
ate case to the integer case. The Kronecker—Schonhage trick is improved in 
Harvey [114] (see Exercise 1.2), and some nice applications of it are given in 
Steel [206]. 

Karatsuba’s algorithm was first published in [136]. Very little is known aaa 
its average gomplexity. What is clear is that no simple asymptotic equivalent 
can be ee the ratio K (n)/n® d ot converge (see Exercise 1.5). 

Andrei Toom [217] discovered the class 5 i ane algorithms, and they 

discussed by Stephen Cook in his thesis [76, pp. 51-77]. A very good de- 
ie ie of these algorithms can be found;in-the book by Crandall and Pomer- 
ance [81, §9.5.1]. In particular, it describe to oa evaluation and 
interpolation formule symbolically. Zuras [235] considers-the 4-way and 5- 
way variants, together with squaring. Bodrhto_add Zanoni [31] show that the 
Toom—Cook 3-way interpolation scheme of §1.3.3 is close to optimal for the 
points 0,1, —1, 2,00; they also exhibit efficient 4-way and 5-way schemes. 
Bodrato and Zanoni also introduced the Toom-2.5 and Toom-3.5 notations for 
what we call Toom-(3, 2) and Toom-(4, 3), these algorithms being useful for 
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unbalanced multiplication using a different number beghsas They noticed 
oom-( only differs from Toom 3-way in the evaluation phase, thus 
most of the mentation can be sh 

The Schénhage-Strassen algorithm first appeared in [199], and il dedcribed 
in §2.3.3. Algorithms using floating-point complex numbers are discussed in 
Knuth’s classic [] 42, §4.3.3.C]. See also §3.3.1. 

The odd—even scheme is described fn Hanrot and Zimmermann [112], and 
was independently discovered by Andreas Enge. The asymmetric squating for- 
mula given|in 81.3.6 was invented by Chung and Hasan (see their paper [66] 
for other asymmetric formule). Exercise 1.4 was suggested by David Harvey, 
who independently discovered the ee Khachatrian et al. [138]. 

See Lefévre [152] for a comparison of different algor{thms for the problem 
of multiplication by an integer constant. 

Svoboda’s algorithm was introduced in [211]. The exact division algorithm 
starting a ia a significant bits is due to J [130]. Jebelean and 
Krapdick i ed the “bidirectional” algorithm! . The Karp—Markstein 
gobs speed up Newton’s iteration (or Hensel lifti ver p-adic numbers) 
1 cribed in [137]. The “recursive division” of §1!4-3-is qe and 
Ziegler [61], although earlier but not-so-detailed ideas can nd in Jebe- 
lean [132], and even earlier in Moenck and Borodin [166]. The definition of 
Hensel’s division usel_her¢ is due to Shand and Vuillemin [201] also 
point out the duality with E iclidean division. 

Algorithm SqrtRem (§1.5.1) was first described in Zimmermann [234], and 
proved correct in Bertot et al. [29]. Algorithm SqrtInt is described in Cdheh 
[73]; its generalization td kth roots (Algprifhm RootInt) is due to Keith Briggs. 
The detection of exact powers is discussed in Bernsteih| Lenstra, and Pila [23] 
and earlier in Bernstein [17] and Cohen [73]. It is necessary, for example, in 
the AKS primality test of ciel sg and Saxena [2]. 

The classical (quadratic) Euclidean algorithm has been considered by many 
authors — a good reference is Knuth [142]. The Gauss—Kuz’min theorem? gives 
the distribution of quotients in the regular continued fraction of almost all real 
ee hence is a good guide to the distribution of quotients in the Eu- 


clidean rith large, random inputs. Lehmer’s original algorithm is de- 
scribed in en aaa binary gcd is al s old as the classical Euclidean 
algorithm — Knuth [142] has traced it b a first-century A inese text 
Chiu Chang Suan Shu (see also Mi i[165]). It edis everal 
times in the 20th century, and it is Sarre omg a [20 he bi- 


nary gcd has been analysed by Brent 44, 50], Knuth [142], Maze [159], and 


2 According to the Gauss—Kuz’ min theorem [139], the probability of a quotient q € N* is 
Ig + 1/q) — Ig. + 1/(q + 1)). 


LI 
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Vallée [221]. Lh oleattel (systolic) version that runs in Olprene using O(n) 
processors was given by Brent and Kung [53]. 

The double-digit gcd is due to Jebelean [131]. The k-ary gc uction is 
due to Sorenson [205], and was improved and implemented in MP by 
Weber. Weber also invented Algorithm ReducedRatMod [226], inspired by 
previous work of Wang. 

The first subquadratic gcd algorithm was published by Knuth [141], but his 
complexity analysis was suboptimal — he gave O(n log” n log log n). The cor- 
rect complexity O(n log? n log log n) was given by Schénhage [195]; for this 
yeasgn the algorithm is sometimes called the Schénhage algorithm. 
A —, on for the pollynot ynoinial case can be found in Aho, Hopcroft, and 
Ullman eas a detailed ak incorrect) description for the integer case in 
Yap [232]. The subquadratic binary gcd given in §1.6.3 is due to Stehlé and 
Zimmermann [208]. Moller [168] compares various subapayraue algorithms, 
and gi nice algorithm without “repair steps”. 

Sedleet autho mention an O(n log” log n) algorithm for the soni: 
tation of the Jacobi symbol: eg. see i Sorenson [89] and Shallit and 
Sorenson [200]. The earliest'reference that we know is a paper by Bach 
which gives the basic idea (due to Gauss [101, p. 509]). Details are giv 
t ok by Bach and Shallit [9, Solution of Exercise 5.52], where the - 
a am said to be “folklore”, with the ideas going back to Leeriaeney | 
and Gaufs. The existence of such an algorithm is mentioned in Schénhage’s 
book [198, §7.2.3], but without details. See also Brent and Zimmermann [57] 
and Exercise 1.32. 


ps 
Modular arithmetic and the FFT 


In this chapter our main topic is modular arithmetic, i.e. how 
to compute efficiently modulo a given integer NV. In most appli- 
cations, the modulus N is fixed, and special-purpose algorithms 
benefit from some precomputatiors; depending only on JN, to 
speed up arithmetic modulo NV. 

There is an Gverlap between Chapter 1 and this chapter. For ex- 
ample, integer division and modular multiplication are closely re- 
lated. In Chapter | we present algorithms where no (or only a few) 
precomputations with respect to the modulus NN are performed. In 
this chapter, we consider algorithms which benefit from such pre- 
computations. 

Unless explicitly stated, we consider that the modulus N occupies 
n words in the word-base (3, i.e. B"~' < N < 8”. 


2.1 Representation 


We consider in this section the different possible representations of residues 
modulo NV. As in Chapter 1, we consider mainly dense representations. 


2.1.1 Classical representation 


The classical representation stores a residue (class) a as an integer0 <a < N. 
Residues are thus always fully reduced, i.e. in canonical form. 

Another non-redundant form consists in choosing a symmetric representa- 
tion, say —N/2 <a ‘ay 2. This form might save some reductions in addi- 
tions or subtractions (see §2.2). Negative numbers might be stored either with 
a separate sign (sign-magnitude representation) or with a two’s-complement 
representation. 
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Since N takes n words in base (3, an alternative redundant representation 
chooses 0 < a < ” to represent a residue class. If the underlying arithmetic 
is word-based, this will yield no slowdown compared to the canonical form. 
An advantage of this representation is that, when adding two residues, it suf- 
fices to compare their sum to 3” in order to decide whether the sum has to 
be reduced, and the result of this comparison is simply given by the carry bit 
of the addition (see Algorithm 1.1 IntegerAddition), instead of by comparing 
the sum with VV. However, in the case that the sum has to be reduced, one or 
more further comparisons are needed. 


2.1.2 Montgomery’s form 


Montgomery’s form is another representation widely used when several mod- 
ular operations have to be performed modulo the same integer N (additions, 
subtractions, modular multiplications). It implies a small overhead to convert — 
if needed — from the classical representation to Montgomery’s and vice-versa, 
but this overhead is often more than compensated by the speedup obtained in 
the modular multiplication. 

The main idea is to represent a residue a by a’ = aR mod N, where 
R= 6", and N takes n words in base 7. Thus Montgomery is not concerned 
with the physical representation of a residue class, but with the meaning as- 
sociated to a given physical representation. (As a consequence, the different 
choices mentioned above for the physical representation are al[poss]ble.) Ad- 
dition and subtraction are unchanged, but (modular) multiplication translates 
to a different, much simpler, algorithm MontgomeryMul (see §2.4.2). 

In most applications using Montgomery’s form, all inputs are first converted 
to Montgomery’s form, using a’ = aR mod N, then all computations are per- 
formed in Montgomery’s form, and finally all outputs are converted back — if 
needed — to the classical form, using a = a’/R mod N. We need to assume 
that (R, N) = 1, or equivalently that (3, N) = 1, to ensure the existence of 
1/R mod N. This is not usually a problem because (3 is a power of two and 
N can be assumed to be odd. 


2.1.3, Residue number systems 


In a residue number system (RNS), a residue a is represented by a list of 
residues a; modulo N;, where the moduli NV; are coprime and their product is 
N. The integers a; can be efficiently computed from a using a remainder tree, 
and the unique integer 0 < a < N = N,N2--:- is computed from the a; by an 


= 
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explicit Chinese remainder theorem (§2.7). The residue number m is inter- 
el on each 


esting since addition and multiplication can be performed in 
small residue a;. This representation requires that N factors into convenient 
moduli N,, N2,..., which is not always the case (see however §2.9). Conver- 
sion to/from the RNS representation costs O(/(n) log n), see §2.7. 


2.1.4 MSB vs LSB algorithms 


Nfany classical (most significant bits first or MSB) algorithms have a p-adic 
(least significant bits first or LSB) equivalent form. Thus several algorithms in 
this chapter are just LSB-variants of algorithms discussed in Chapter 1 — see 
Table 2.1 below. 


classical (MSB) p-adic (LSB) 
Euclidean division Hensel division, Montgomery reduction 
Svoboda’s algorithm Montgomery—Svoboda 
Euclidean ged binary gcd 
Newton’s method Hensel lifting 


Table 2.1 Equivalence between LSB and MSB algorithms. 


2.1.5 Link with polynomials 


As in Chapter 1, a strong link exists between modular arithmetic and arith- 
metic on polynomials. One way of implementing finite fields Fy with g = p” 
elements is to work with polynomials in F,,[z], which are reduced modulo a 
monic irreducible polynomial f(x) € F,[2] of degree n. In this case, modular 
reduction happens both at the coefficient level (in F,,) and at the polynomial 
level (modulo f(x)). 

Some algorithms work in the ring (Z/NZ) |x], where N is a composite in- 
teger. An important case is the Schénhage-Strassen multiplication algorithm, 
where N has the form 2 + 1. 

LJ In both domains F,,[z] and (Z/NZ)[x], the Kronecker-Schonhage trick 
(81.3) can be applied efficiently. Since the coefficients are known to be bounded, 
by p and WN respectively, and thus have a fixed size, the segmentation is quite 
efficient. If polynomials have degree d and coefficients are bounded by JN, 
the product coefficients are bounded by dN”, and we have O(M(dlog(Nd))) 
operations, instead of O(M(d)M (log N))) with the classical approach. Also, 
the implementation is simpler, because we only have to implement fast 
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arithmetic for large integers instead of fast arithmetic at both the polynomial 
level and the coefficient level (see also Exercises 1.2 and 2.4). 


2.2 Modular addition and subtraction 


The addition of two residues in classical representation can be done as in 
Algorithm ModularAdd. 


Algorithm 2.1 ModularAdd 
Input: residues a,b withO <a,b< N 
Output: c=a+bmod N 


cH—at+b 
if c > N then 
c-—c-—N. 


Assuming that a and 6 are uniformly distributed in ZM [0, N — 1], the sub- 
traction c — c — N is performed with probability (1 — 1/N)/2. If we use 
instead a symmetric representation in [—NV/2, N/2), the probability that we 
need to add or subtract N drops to 1/4 + O(1/N?) at the cost of an additional 
test. This extra test might be expensive for small N — say one or two words — 
but should be relatively cheap if N is large enough, say at least ten words. 


2.3 The Fourier transform 


In this section, we introduce the discrete Fourier transform (DFT). An impor- 
tant application of the DFF4s in computing convolutions via the Convolution 
Theorem. In general, the convolution of two vectors can be computed using 
three DFTs (for details see §2.9). Here we show how to compute the DFT ef- 
ficiently (via the fast ier transform or FFT), and show how it can be used 
to multiply two et ali in time O(n log n log log n) (the Schonhage— 
Strassen algorithm, see §2.3.3). 


2.3.1 Theoretical setting 


Let R be a ring, K > 2 an integer, and w a principal Kth root of unity in 
R,i.e. such that w* = 1 and =. w) = 0 forl <i < K. The Fourier 
transform (or forward (Fourier) transform) of a vector a = [ag,@1,...,@K—1] 
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of K elements from F is the vector a = [Go, @1,...,@« —1] such that 
ea 
a= > Wag (2.1) 
j=0 


If we transform the vector a twice, we get back to the initial vector, apart 
from a multiplicative factor K and a permutation of the elements of the vector. 
Indeed, for0 <i< Kk 


K-1 K-1 K-1 K-1 K-1 
a; = y wa; = y wd 5 wag = y ap y its 
j=0 j=0 £=0 £=0 j=0 


Lett = wt” Ifi+0 40 mod K,ie. ifi+@is not 0 or K, the sum So 7 
vanishes since w is principal. For i+ ¢ € {0, &}, we have 7 = 1 and the sum 
equals K’. It follows that 

K-1 
@=K > ap = Kari moax- 


Lah: 
i+0E{0,K} 
Thus, we have a = K|ap,aK-1,@K_—2,-.., 42,41]. 


If we transform the vector a twice, but use w—! instead of w for the second 
transform (which is then called a backward transform), we get 


K-1 K-1 K-1 Kel K-1 
a; = y w Ia;= y w 3 y uray = y ag y w(l)I 
j=0 j=0 £=0 £=0 j=0 


The sum ar w'-)J vanishes unless £ = i, in which case it equals K. 


Thus, we have ai = Ka,. Apart from the multiplicative factor Kr, the backward 
transform is the inverse of the forward transform, as might be expected from 
the names. 


2.3.4_The fast Fourier transform 


If evaluated naively, Eqn. (2.1) requires Q(K?) operations to compute the 
Fourier transform of a vector of Ix eletnenls. The fast Fourier transform or 
FFT is an efficient way to evaluate Eqn. (2.1) using only O( Kc log Kk’) oper- 
ations. From now on we assume that I is a power of two, sincé_this is the 
most common case and simplifies the description of the FFT (see §2.9 for the 
general case). 
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Let us illustrate the FFT for K = 8. Since w® = 1, we have reduced the 
exponents modulo 8 in the following. We want to compute 


ao = a9 + Gy + Ag + 43 + G4 + A5 + G6 + G7, 


a 5 
a, = a9 + way, way t was wag was wag war, 
G2 = ao way war wag + a4 4 was t wag t war, 

a3 = ao way wag Wag wa, was wag war, 


a4 = ao way a2 + wag + G4 +t was + ag +t war, 


Gs = do way way wag wag Was5 wag war, 
ag = ao way wag wa3 a4 4 was t wag t war, 
a7 = ao way wag wag wag was { wag + WA7. 


We see that we can share some computations. For example, the sum ag + a4 
appears in four places: in Go, G2, G4, and dg. Let us define ap,4 = ao + aa, 
a1,5 = 1 + 45, 42,6 = G2 + 6, 43,7 = a3 + A7, G40 = Ao + waa, a5,1 = 
a, +was, dg.2 = a2 + wag, a7,3 = a3 +w*a7. Then we have, using the fact 
that w® = 


Go = a0,4 + a1,5 + a2,6 + 43,7, @ = a4,0 + was. + wa6,2 F wa7,3, 
Go = a0,4 + w7a1,5 +wiar6+was,7, G3 = dao +w2as5,1 + w%a6.2 + war, 
Ga = 40,4 wats a2,6 + wa3.7, @5 = Ga4,0 was, t wa6,2 wars, 
Ge = a0,4 + wars +wiar.6 +wa3.7, G7 = a4 +w'as,1 + w®a62 + w°az,3. 


Now the sum a,4 + @2,6 appears at two different places. Let a9,4,2.6 = a0,4 + 


= = 4 = 4 
42,6, 41,5,3,7 = 41,5 + 43,7, 426,04 = 40,4 FW 42,6, 43,7,1,5 = 41,5 + W a3,7, 


_ 2 _ 2 = 6 
44,0,6,2 = 44,0 + W°G6,2, 45,1,7,3 = 45,1 + W°A7,3, 46,2,4,0 = G40 + W'd6,2, 


Q7,3,5,1 = 45,1 + waz 3. Then we have 


ad = 0,4,2,6 + 41,5,3,75 aq = 44,0,6,2 + W5,1,7,35 

@2 = a2604+ wa3.7,1,55 a3 = 462,40 + w3a7,3,5,1; 
a = ao0,4,2,6 + w4a15,3,7; a = 4,0,6,2 7 w?5,1,7,3; 
a@ = 42,6,0,4 + w®a3,7,1,55 a = 46,2,4,0 17 w"a7,3,5,1- 


In summary, after a first stage where we have computed eight intermediary 
variables ag,4 to a7,3, and a second stage with eight extra intermediary vari- 
ables ao,4,2,6 to a7,3,5,1, we are able to compute the transformed vector in eight 
extra steps. The total number of steps is thus 24 = 8 lg 8, where each step has 
the forma — b+ we, 

If we take a closer look, we can group operations in pairs (a, a’), which have 
the form a = b+ w/c and a’ = b + w/+t4c, For example, in the first stage we 
have a,,5 = a, + G5 and a5, = a, + was; in the second stage we have 


44,0,6,2 = 44,0 +wa6.2 and 46,2,4,0 = 44,0 +w®ag 2. Since w* = -l, this can 
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also be written (a, a’) = (b ++ w/c, b — wie), where w/c needs to be computed 
only once. A pair of two such operations is called a butterfly operation. 

The FFT can be performed in place. Indeed, the result of the butterfly 
between ao and ay, i.e (a9,4, 44,0) = (49 +44, ao — 4), can overwrite (ap, a4), 
since the values of ap and aq are no longer needed. 

Algorithm ForwardFFT is a recursive and in-place implementation of the 
forward FFT. It uses an auxiliary function bitrev(j, A’), which returns the bit- 
reversal of the integer j, considered as an integer of lg K bits. For example, 
bitrev(j, 8) gives 0, 4, 2,6,1,5,3,7 for 7 =0,...,7. 


Algorithm 2.2 ForwardFFT 
Input: vector a = [ag,a1,--.,@x—1], w principal Kth root of unity, K = 2" 
Output: in-place transformed vector a, bit-reversed 

1: if K = 2 then 

2: [ao, 1] = [apo + a1, 49 — a4] 

3: else 

4: [ao, @9, .., QK—2| — ForwardFFT({ao, ao, ..., dK —2], 7, K/2) 

5: [a1, 43, ..., Qx—1] — ForwardFFT(([a1, a3, ...,@%—1], 7, K/2) 

6: for j from 0 to K/2—1do 

7: [a2j, aaj+1] — [aay + wR aot ag — wr E/) ao 544), 
Theorem 2.1 Given an input vector a = [ao,@1,...,@K-~1], Algorithm 


ForwardFFT replaces it by its Fourier transform, in bit-reverse order, in 
O(K log K) operations in the ring R. 


Proof. We prove the statement by induction on K = 2k For K = 2, the 
Fourier transform of [a9, a1] is [a9 + @1, 49 + way], and the bit-reverse order 
coincides with the normal order; since w = —1, the statement follows;Now 
assume the statement is true for K/2. Let 0 j} < K/2, and write7’ := 
bitrev(j, K/2). Let b = [bo, ..., bx 2-1] be fe 5 obtained at step 4, and 
c = [co, «CK /2-1] be the vector obtained at step 5. By induction 


K/2-1 K/2-1 
S- 27’ > 27’ 
b; = wd ae, Cj = wd A20+1- 
£=0 £=0 oO 


Since b; is stored at a2; and c; at a2;+1, we compute at step 7 


K/2-1 K/2-1 Ka 


: 250 i’ 2j/e je ~ 
ag; = by +w? cj = y wt “aget+w y w*) “aoe41 = y w! “ag = Gj. 
£=0 £=0 =0 
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—_ ; ji rae 
Similarly, since —w7 = w*/2+d 


K/2-1 K/2-1 
2j'e K/2+j’ 2j'e 
a2541 = y wd Face + wh/?+3 y w? “areq1 

l=0 é=0 

K-1 
= K/2+j’)£, =~ 
5 te ay Gis: 

e=0 

-/ gale 2 = - oT 
where we used the fact that w27 = w?W'+*/)_ Since bitrev(2j,K) = 


bitrev(j, A’/2) and bitrev(2j + 1, K) = K/2 + bitrev(j, K/2), the first part 
of the theorem follows. The complexity bound follows from the fact that the 
cost T’(Ic) satisfies the recurrence T(A‘) < 2T (is /2) + O(4). 


Algorithm 2.3 BackwardFFT 


Input: vector a bit-reversed, w principal K'th root of unity, K = 2* 
Output: in-place transformed vector a, normal order 


1: if K = 2 then 

2: [ao, a1] — [ao + a1, a0 — ai] 

3: else 

4: (ao, ---; @K/2-1] + BackwardFFT([ao, ..., aK/2-1],", K/2) 

5: [@K/2, +) @K—1] — BackwardFFT ((ax/2, ..., 4-1], w”, K/2) 

6: for j from 0 to K/2—1do pw I =wK-5 
7: [a5,0K/245] — [aj + wIaxyjo45,0; — Ww Jaxyo4,). 


Theorem 2.2 Given an input vector a = [ao, AK /25++- , aK —1| in bit-reverse 
order, Algorithm BackwardFFT replaces it by its backward Fourier trans- 
form, in normal order, in O(K log K) operations in R. [1 


Proof. The complexity bound follows as in the proof of Theorem 2.1. For 
the correctness result, we again use induction on K = 2* For K = 2, the 
backward Fourier transform @ = [ag + 1,49 + w taj] is exactly what the 
algorithm returns, since w = w! = —1 in that case. Assume now K > 4, 
a power of two. The first half, say b, of the vector a corresponds to the bit- 
reversed vector of the even indices, since bitrev(2j,K) = bitrev(j, K/2). 
Similarly, the second half, say c, corresponds to the bit-reversed vector of the 
odd indices, since bitrev(27 + 1, K) = K/2 + bitrev(j, K/2). Thus, we can 
apply the theorem by induction to b and c. It follows that b is the backward 
transform of length K/2 with w? for the even indices (in normal order), and 
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similarly c is the backward transform of length k/2 for the odd indices 


K/2-1 K/2-1 
» 25k 5 —25e 
b; _ wd ae, Cj — w “J a2e+1- 
£=0 £=0 


Since 6; is stored in aj and c; in ax/2+;, we have 


K/2-1 K/2-1 
= pe pad —2jl =y —2jl 
a; =O; +W °C] = Ww a2 + WwW Ww a20+41 
£=0 


£=0 
K-1 
= ae w Fag = aj, 
£=0 


and similarly, using —w74 = w7*/?-9 and w7?7F = wo 2(K/24+9) 


K/2-1 K/2-1 
~256 ~K/2-j ~250 
aK/245 = y wage + w K/2-5 y w "aze41 
t=0 t=0 
K-1 
_ -~(K/249)e, 
=) yy (K/2+3) ap = GK /245- 
=0 


2.3.3. The Schénhage-Strassen algorithm 


We now describe the Schénhage-Strassen O(n log n log log n) algorithm to 
multiply two integers of n bits. The heart of the algorithm is a routine to mul- 
tiply two integers modulo 2” + 1. 


Theorem 2.3 Given 0 < A,B < 2” + 1, Algorithm FFTMulMod correctly 
returns A- B mod (2” + 1), and it costs O(n log n log log n) bit-operations if 


K = O(y/n). O 


Proof. The proof is by induction on n, because at step 8 we call FFTMulMod 
recursively, unless n’ is sufficiently small that a simpler algorithm (classical, 
Karatsuba or Toom—Cook) can_be used. There is no difficulty in starting the 
induction. 

With a;,b; the values at steps 1 and 2, we have A = a a;2)™ and 
B= yy29' bj27™; thus, A- B= 085" 2M mod (2” + 1) with 


K-1 


K-1 
= >) athm- Y.  agbr. (2.2) 


2,m=0 £,m=0 
l+m=j l+m=K+j 


56 Modular arithmetic and the FFT 


Algorithm 2.4 FFTMulMod 
Input: 0 < A, B < 2” 4+ 1, an integer K = 2* such thatn = MK 
Output: C = A- B mod an +1) 
1: decompose A= ae ~0 ay 2i@ with O< aj< 2”, except that 
0<ax_-1 <2” 


2: decompose B similarly 

3: choose n’ > 2n/K +k, n’ multiple of K; let 0 = an'/K yy = 6? 

4: for j from 0 to K — 1 do 

5:  (a;,b;) — (a;,6%b;) mod (2”" + 1) 

6: a — ForwardFFT(a,w,k), b << ForwardFFT(b, w, kK) 

7: for j from 0 to K — 1 do > call FFTMulMod 

8 c; — a;b; mod (2° +1) > recursively if n’ is large 


9: ¢ — BackwardFFT(c, w, kK) 

10: for j from 0 to kK — 1 do 

I: cj — c;/(K07) mod (2™ +1) 
2: fc; > (f + 1)2?™ then 

13: Ci a (2 41) 

14, C= sae 725M. 


We have (j + 1 — K)2?” < cj < (7 + 1)2?™, since the first sum contains 
j +1 terms, the second sum Kk — (j +{]]) terms, and at least one of ag and b,, 
is less than 2™ in the first sum. CL] a 

Let ai, be the value of a; after step 5: a; = Wa; mog{2” + 1), and 
similarly for bi. Using Theorem 2.1, after es 6 we have Apitrev(j,K) = 


= a yi a mod (2” + 1), and similarly for b. Thus at step 8 


Chitrev(j,K)_= wa wp! 
among (2%) (Eo), 


After step 9, using Theorem 2.2 


K-1 K-1 K-1 
— ij £5 mip! 
c= ) w ) ws ae y wb 
j=0 £=0 


m=0 


K-1 K-1 
=K S> ayb,, + K ‘ apbi., 


£,m=0 £,m=0 
e+m=i l+m=K+i 


The first sum equals 6° )°),..,—; d¢bm; the second is 0*+* 37). 4; dem. 


CO 


2.3 The Fourier transform a7 


Since 0* = —1 mod (2”’ + 1), after step 11 we have 


1 K=1 
cy G = 5 aebm — _ debm, mod oe +1). 


£,m=0 £,m=0 
l+m=i l+m=K +i 


The correction at{step 1B bnsures that c; lie{ih the correct interval, as given by 
Eqn. (2.2). Oo 

For the complexity analysis, assume that kK = O(,/n). Thus, we have 
n’ = O(/n). Steps 1 and 2 cost[Q(n); step 5 also costs O(n) (counting the 
cumulated cost for all values of 7). Step 6 costs O(.K log K’) times the cost 
of one butterfly operation mod (2” + 1), which is O(n’), thus a total of 
O(Kn' log K) = O(nlogn). Step 8, using the same algorithm recursively, 
costs O(n’ logn’loglogn’) per value of 7 by the induction hypothesis, 
giving a total of O(n log nloglogn). The backward FFT costs O(n log n) 
too, and the final steps cost O(n), giving a total cost of O(n log n log log n). 
The loglogn term is the depth of the recursion, each level reducing n to 


n! = O(n). 


EXAMPLE: to multiply two integers modulo (21°48°76 + 1), we can take K = 
210 — 1024, and n’ = 3072. We recursively compute 1024 products modulo 
(2907? + 1). Alternatively, we can talfé]the smaller value K = 512, with 512 
recursive products modulo (246° + 1). 


REMARK 1: the “small” products at step 8 (mod (2207? +1) or mod (246°8 +1) 
in our example) can be performed by the same algorithm applied recursively, 
but at sme] point (determined by details of the implementation) it will be more 
efficient to use a simpler algorithm, such as the classical or Karatsuba algo- 
rithm (see §1.3). In practice, the depth of recursion is a small constant, typi- 


cally 1 or 2. Thus, for practical oses, the log log n term can be regarded 
as a constant. For a theoretical way of avoiding the log logn term, see the 
comments on Fiirer’s algorithm in §2.9. 


RK 2: ifWe replace @ by 1 in Algorithm FFTMulMod, i.e. remove 
a replace step 11 by c; — c;/K mod (2” +1), and re e condition 
at step 12 by c; > K-2?, then we compute C = A-B 5 cr 1) instead 
of mod(2” + 1). This is useful in McLaughlin’s algorithm (82.4.3). 

Algorithm FFTMulMod enables us to multiply two integers modulo (2” + 
1) in O(n log n log log n) operations, for a suitable n and a corresponding FFT 
length K = 2". Since we should have K ~ \/n and K must divide n, suitable 
values of n are the integers with the low-order half of their bits zero; there is 
no shortage of such integers. To multiply two integers of at most n bits, we 
first choose a suitable bit size m > 2n. We consider the integers as residues 
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modulo (2 + 1), then Algorithm FFTMulMod gives their integer product. 
The resulting complexity is O(n log n log log n), since m = O(n). In practice, 
the log log n term can be regarded as a constant; theoretically, it can be replaced 
by an extremely slowly growing function (see Remark 1 above). 

In this book, we sometimes implicitly assume that n-bit integer multiplica- 
tion costs the same as three FFTs of length 2n, since this is true if an FFT-based 
algorithm is used for multiplication. The constant “three” can be reduced if 
some of the FFTs can be precomputed and reused many times, for example if 
some of the operands in the multiplications are fixed. 


2.4 Modular multiplication 


Modular multiplication means computing A - B mod N, where A and B are 
residues modulo NV. Of course, once the product C = A-B has been computed, 
Ci. suffices to perform a modular reduction C mod N, which itself reduces to 
an integer division. The reader may ask why we did not cover this topic in 
§1.4. There are two reasons. First, the algorithms presented below benefit from 
some precomputations involving NV, and are thus specific to the case where 
several reductions are performed with the same modulus. Second, some algo- 
rithms avoid performing the full product C = A - B; one su¢h_exdmple is 
McLaughlin’s algorithm (§2.4.3). 

Algorithms with precomputations include Barrett’s algorithm (§2.4.1), which 
computes an approximation tp the ipverse of the modulus, thus trading division 
for multiplication; Montgomery’s algorithm, which corresponds to Hensel’s 

Tyision with remainder only (§1.4.8), and its subquadratic variant, which is 
the LSB-variant of Barrett’s algorithm; and finally McLaughlin’s algorithm 
(82.4.3). The cost of the precomputations is not taken into account; it is 
assumed to be negligible if many modular reductions are performed. How- 
ever, we assume that the amount of precomputed data uses only linear, i.e. 
O(log N), space. 

As usual, we assume that the modulus NV has n words in base (3, that A and 
B have at most n words, and in some cases that they are fully reduced, i.e. 
0< A, B<N. 


2.4.1 Barrett’s algorithm 


Barrett’s algorithm is attractive when many divisions have to be made with 
the same divisor; this is the case when we perform computations modulo a 
fixed integer. The idea is to precompute an approximation to the inverse of 
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the divisor. Thus, an approximation to the quotient is obtained with just one 
multiplication, and the corresponding remainder after a second multiplication. 
A small number of corrections suffice to convert the approximations into exact 
values. For the sake of simplicity, we describe Barrett’s algorithm in base (3, 
where (3 might be replaced by any integer, in particular 2” or 3”. 


Algorithm 2.5 BarrettDivRem 

Input: integers A, B withO < A < 67, 8/2<B< 6 

Output: quotient Q and remainder R of A divided by B 

: I—|6?/B| > precomputation 
: Q<— |A,I/B\ where A = A,B + Ap with O < Ap < 8 

R-A-QB 

: while R > B do 

(Q, R) ae tO 1 = 8) 

: return (Q, R). oO 


Theorem 2.4 Algorithm BarrettDivRem is correct and step 5 is performed 
at most three times. 


Proof. Since A = QB + Ris invariant in the algorithm, we just need to prove 
that0 < R < B at the end. We first consider the value of Q, R before the 
while-loop. Since 3/2 < B < 3, we have 3 < 6?/B < 2; thus, 8 < 
I < 26. We have Q < AiI/@G < Ai8/B < A/B. This ensures that R is 
non-negative. Now I > 6?/B — 1, which gives 


IB> 8? -B. 
Similarly, Q > A, I/( — 1 gives 
BQ > Ail - fp. 


This yields GQB > A,IJB— 8B > A,(6? — B) — BB = B(A— Ao) - 
B(G+A,) > GA-—4GB since Ag < B < 2B and A; < 3. We conclude that 
A < B(Q +4); thus, at most three corrections are needed. 


The bound of three corrections is tight: it is attained for A = 1980, B = 36, 
6 = 64. In this example, [El 1144, = 30, Q = 52, R = 108 = 3B. 

The multiplications at steps ahaha 3 may be replaced by short products, motel 
precisely the multiplicatidn at step 2 by a high short product, and that at step 3 
by a low short product (see §3.3). 

Barrett’s algorithm can also be used for an unbalanced division, when divid- 
ing (k + 1)n words by n words for k > 2, which amounts to k divisions of 
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2n words by the same n-word divisor. In this case, we say that the divisor is 
implicitly invariant. 


Complexity of Barrett’s algorithm | 
If the multiplications at steps 2 and 3 are performed using full products, 
Barrett’s algorithm costs 2!(n) for a divisor of size n. In the FFT range, 
this cost might be lowered to 1.5 (n) using the “wrap-around trick” (§3.4.1); 
moreover, if the forward transforms of J and B are stored, the cost decreases 
to M(n), assuming / (7) is the cost of three FFTs. 


2.4.2 Montgomery’s multiplication 


Montgomery’s algorithm is very efficient for modular arithmetic modulo a 
fixed modulus N. The main idea is to replace a residue A mod N by A’ = 
AA mod N, where A’ is the “Montgomery form” corresponding to the residue 
A, with \ an integer constant such that gcd(NV, A) = 1. Addition and subtrac- 
tion are unchanged, since AA + AB = \(A + B) mod N. The multiplication 
of two residues in Montgomery form does not give exactly what we want: 
(AA)(AB) #4 A(AB) mod N. The trick is to replace the classical modular 
multiplication by “Montgomery’s multiplication” 
, P A’B’ 

MontgomeryMul(A’, B’) = X mod N. = 
For some values of 4, MontgomeryMul(A’, B’) can easily be computed, in 
particular for 1 = 8”, where N uses n words in base 3. Algorithm 2]6 is 
a quadratic algorithm (REDC) to compute MontgomeryMul(A’, B’) in this 
case, and a subquadratic reduction (FastREDCy15 given in Algorithm 2.7. 

Another view of Montgomery’s algorithm for \ = 3” is to consider that it 
computes the remainder of Hensel’s division (§1.4.8). 


Algorithm 2.6 REDC (quadratic non-interleaved version). The c; form the 
current base-/3 decomposition of C, i.e. they are defined by C = aay cif . 
Input: 0< C < 62", N < B", 4 — —N7! mod @, (8, N) =1 

Output: 0 < R < GB" such that R = CG~”" mod N 


1: for i from 0 ton — 1 do 

2: qi <— pec; mod ZB > quotient selection 
3: CH~C+qNs' 

4 R—CB” > trivial exact division 
5: if R > 2" then return R — N else return R. 


2.4 Modular multiplication kl 
Theorem 2.5 Algorithm REDC is correct. 


Proof. We first prove that R = C3~”" mod N: C is only modified in stepl3] 
which does not change C mod N; thus, at step 4 we have R = CG—”" mod N, 
and this remains true in the last step. 

Assume that, for a given i,[We have C = 0 mod 3 when entering step 2. 
Since g; ={}+c;/N mod £, we have C + g;N@* = 0 mod 3’! at the next 
step, so the next value of c; is 0. Thus, on exiting the for-loop, C’ is a multiple 
of 8”, and R is an integer at step 4. 

Still at step 4, we have C < 6?" + (8—1)N(14+64+---+ 6") = 
GB?" + N(B”" — 1); thus, R < 8" + Nand R-—N <p". 


Compared to classical division (Algorithm 1.6 BasecaseDivRem), Mont- 
gomery’s algorithm has two significant advantages: the quotient selection is 
performed by a multiplication modulo the word base 3, which is more effi- 
cient than a division by the most significant word b,,_; of the divisor as in 
BasecaseDivRem; and there is no repair step inside the for-loop — the repair 
step is at the very end. 

For example, with inputs C = 766 970 544 842 443 844, N = 862664913, 
and 3 = 1000, Algorithm REDC precomputes ju = 23; then we have go = 412, 
which yields C — C+ 412N = 766970900 260 388 000; then qi = gchal 
which yields C — C + 924NG = 767 768 002 640 000 000; then gz = 720, 
which yields C — C + 720NG? = 1388886 740000000000. At step 4, 
R = 1388 886 740, and since R > 6?, REDC returns R — N = 526 221 827. 

Since Montgomery’s algorithm —i.e. Hensel’s division with remainder only — 
can be viewed as an LSB variant of classical division, Svoboda’s divisor pre- 
conditioning (81.4.2)fajso translates to the LSB context. More precisely, in Al- 
gorithm REDC, we want to modify the divisor N so that the quotient selection 
q <— pec; mod £ at step 2 becomes trivial. The multiplier & used in Svoboda 
division is simply the parameter js in REDC. A natural choice is js = 1, which 
corresponds to N = —1 mod (. This motivates the Montgomery—Svoboda 
algorithm, which is as follows: 


1. first compute N’ = wN, with N’ < 6+, where wp = —1/N n, : 

2. perform the n — 1 first loops of REDC, replacing p by 1, and pain “3 

3. perform a final classical loop with y and N, and the last steps (4-5) from 
REDC. 


Quotient selection in the Montgomery—Svoboda algorithm simply involves 
“reading” the word of weight /3’ in the divisor C. 

For the example above, we get NV’ = 19 841 292 999; qo is the least signifi- 
cant word of C,i.e. qo = 844,80 C — C+844N' = 766 987 290 893 735 000; 
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then q, = 735 and C — C + 735.N’G = 781570641 248 000 000. The last 
step gives gg = 704 and C ~— C+ 704N6? = 1388886 740 000 000 000, 
which is what we found previously. 


Subquadratic Montgomery reduction 
A subquadratic version FastREDC of Algorithm REDC is obtained by taking 
n = 1, and considering ( as a “giant base” (alternatively, replace 3 by 3” 
below): 


Algorithm 2.7 FastREDC (subquadratic Montgomery reduction) 
Input: 0 < C < 67, N < B,4——1/N mod B 
Output: 0 < R < 6 such that R = C/G mod N 

1: Q<— uC mod B 

2 R—(C+QN)/8 

3: if R > 2 then return R — N else return R. 


OO 


This is exactly the 2-adic counterpart of Barrett’s subquadratic algorithm; 
steps 1-2 might be performed by a low short product and a high short product, 
resped{ilely. 

When combined with Karatsuba’s multiplication, assuming the products 
of steps 1—2 are full products, the reduction requires two multiplications of 
size n, i.e. six multiplications of size n/2 (n denotes the size of N, 3 being a 
giant base). With some additional precomputation, the reduction might be 
performed with five multiplications of size n/2, assuming n is even. This is 
simply the Montgomery—Svoboda algorithm with NV having two big words in 
base 3"/?. The cost of the algorithm is M(n,n/2) to compute qo N’ (even if 
N‘ has in principle 3n/2 words, we know N’ = Hp"/? —1with H < 6", and 
thus it suffices to multiply qo by H), M(n/2) to compute .C' mod 6"/?, and 


Algorithm 2.8 MontgomerySvoboda 

Input: 0< C < 62", N < 8", — —1/N mod 8"/?, N’ = uN 
Output: 0 < R < 8" such that R = C/8" mod N 

: do — C mod B"/? 

:CH—(C+ qoN’)/r/? 

: qi — pC mod B"/? 

>Re (C+qmN)/p"/? 

: if R > GB" then return R — N else return R. 


nA & WN 
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again M(n,n/2) to compute q; NV; thus, a total of 5M (n/2) if each n x (n/2) 
product is realized by two (n/2) x (n/2) products. 

The algorithm is quite similar to the one described at the end of § 1.4.6, where 
the cost was 3M (n/2)+D(n/2) for a division of 2n by n with remainder only. 
The main difference here is that, thanks to Montgomery’s form, the last classi- 
cal division D(r]/2) in Svoboda’s algorithm is replaced by multiplications of 
total cost 2//(n/2), which is usually faster. 

Algorithm MontgomerySvoboda can be extended as follows. The value C 
obtained after step 2 has 3n/2 words, i.e. an excess of n/2 words. Instead of 
reducing that excess with REDC, we could reduce it using Svoboda’s tech- 
nique with ’ = —1/N mod B"/4, and N” = y'N. This would reduce the 
low n/4 words from C at the cost of M(n,n/4), and a last REDC step would 
reduce the final excess of n/4, which would give D(2n,n) = M(n,n/2) + 
M(n,n/4)+M(n/4)+M(n,n/4). This “folding” process can be generalized 
to D(2n,n) = M(n,n/2) +--+ M(n,n/2") + M(n/2*) + M(n,n/2"). 
If M(n,n/2") reduces to 2° M(n/2*), this gives 


D(n) = 2M(n/2)+4M (n/4)+---+2"-1M(n/2*-1)+(2**141)M(n/2*). 


Unfortunately, the resulting fuulkiplications become more and more unbal- 
anced, and we need to store k precomputed multiples N’, N”’,... of N, each 
requiring at least n words. Table 2.2 shows that the single-folding algorithm is 
the best one. 


Algorithm | Karatsuba Toom-—Cook 3-way Toom—Cook 4-way 


D(n) 2.00M (n) 2.63M (n) 3.10M(n) 
1-folding 1.67M (n) 1.81M(n) 1.89M (n) 
2-folding 1.67M(n) 1.91M(n) 2.04M(n) 
3-folding 1.74M(n) 2.06M(n) 2.25M (n) 


Table 2.2 Theoretical complexity of subquadratic REDC with 1-, 2- and 
3-folding, for different multiplication algorithms. 


Exercise 2.6 discusses further possible improvements in the Montgomery— 
Svoboda algorithm, achieving D(n) ~ 1.58M(n) in the case of Karatsuba 
multiplication. 


2.4.3 McLaughlin’s algorithm 


McLaughlin’s algorithm assumes we can perform fast multiplication modulo 
both 2” — 1 and 2” + 1, for sufficiently many values of n. This assumption is 
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true for example with the Schénhage—Strassen algorithm: the original version 
multiplies two numbers modulo 2” + 1, but discarding the “twist” operations 
before and after the Fourier transforms computes their product modulo 2” — 1. 
(This has to be done at the top level only: the recursive operations compute 
modulo 2”’ + 1 in both cases. See Remark 2 on page 57.) 

The key idea in McLaughlin’s algorithm is to avoid the classical “multiply 
and divide” method for modular multiplication. Instead, assuming that N is 
relatively prime to 2" — 1, it determines AB/(2" — 1) mod N with convo- 
lutions modulo 2” + 1, which can be performed in an efficient way using the 
FFT. 


Algorithm 2.9 MultMcLaughlin 

Input: A,B withO < A,B < N <2”, = —N7! mod (2” — 1) 
Output: AB/(2” — 1) mod N 

: m<— AByp mod (2” — 1) 

: S— (AB+mN) mod (2" + 1) 

w — —S mod (2” + 1) 

if 2|}w then s — w/2 else s — (w+2"+1)/2 

: if AB+mN =s mod 2 thent — selset —s+2"+1 

: ift < N then return ¢ else return t — N. 


Theorem 2.6 Algorithm MultMcLaughlin computes AB/(2" — 1) mod N 
correctly_in ~ 1.5M(n) ne assuming multiplication modulo 2" + 1 


costs ~ #4 (n/2), or the santeas 3 Fourier transforms of size n. 


Proof. Step 1 is similar to step 1 of Algoyithm FastREDC, with ( replaced by 
2” A It follows that AB+mN = 0 mod (2” — 1), therefore we have ny 
mN = k(2"—1) withO < k < 2N. Step 2 computes S = —2k mod (2”+1), 
then step 3 gives w = 2k mod (2” + 1), and s = k mod (2” + 1) in step-+. 
Now, since 0 < k < 2”+!) the value s does not uniquely determine k, whose 
missing bit is determined from the least significant bit from AB+mN (step 5). 
Finally, the last step reduces t = k modulo N. 

The cost of the algorithm is mainly that of the four multiplications AB mod 
(2” +1), (AB) mod (2” — 1) and mN mod (2” +1), which cost 4M (n/2) 
altogether. However, in (AB) mod (2" — 1) and mN mod (2” + 1), the 
operands 44 and N are invariant, therefore their Fourier transforms can be pre- 
computed, which saves 2/4 (n/2) /3 altogether. A further saving of M (n/ 2)I23 
is obtained since we perform only one backward Fourier transform in step 2. 
Accounting for the savings gives (4 — 2/3 — 1/3)M(n/2) = 3M(n/2) ~ 
1.5M(n). 
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The ~ 1.5M(n) cost of McLaughlin’s algorithm is quite surprising, since it 
means that a modular multiplication can be performed faster than two multi- 
plications. In other words, since a modular multiplication is basically a mul- 
tiplication followed by a division, this means that (at least in this case) the 
“division” can be performed for half the cost of a multiplication! 


2.4.4 Special moduli 


For special moduli N faster algorithms may exist. The ideal case is N = 
B”" + 1. This is precisely the kind of modulus used in the Schonhage—Strassen 
algorithm based on the fast Fourier transform (FFT). In the FFT range, a mul- 
tiplication modulo 3” + 1 is used to perform the product of two integers of 
at most n/2 words, and a multiplication modulo 3” + 1 costs ~ M(n/2) ~ 
M(n)/2. 

For example, in elliptic curve cryptography (ECC), we almost always use a 
special aoe for example a pseudo-Mersenne prime like 2!°? — 264 — ] 
or 2?°6 + 2192 + 996 _ 1. However, in most applications the modulus 
can not be chosen, and there is no reason for it to have a special form. 

We refer to $2.9 for further information about special moduli. 


2.5 Modular division and inversion 


We have seen above that modular multiplication reduces to integer division, 
since to compute ab mod N, the classical method consists of dividing ab by N 
to obtain ab = qN-+r, then ab = r mod N. Inthe same vein, modular division 
reduces to an (extended) integer gcd. More precisely, the division a/b mod N 
is usually computed as a q /b) mod N, thus a modular inverse is followed by 
a modular multiplication. We concentrate on modular inversion in this section. 
We have seen in Chapter | that computing an extended gcd is expensive, 
both for small sizes, where it usually costs the same as several multiplications, 
and for large sizes, where it costs O(M(n) log n). Therefore, modular inver- 
sions shi e avoided if possible; we explain at the end of this sl ass 
this can e. 
Algorithm 2.10 (ModularInverse) is just Algorithm ExtendedGcd (§ 1.6.2), 
with (a,b) — (b, NV) and the lines computing the cofactors of N omitted. 
Algorithm ModularInverse is the naive version of modular inversion, with 
complexity O(n?) if N takes n words in base 3. The subqiadratlc 
O(M(n) log n) algorithm is based on the HalfBinaryGcd algorithm (§1.6.3). 
When the modulus N has a special form, faster algorithms may exist. In 
particular for N = p*, O(M(n)) algorithms exist, based on Hensel lifting, 
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Algorithm 2.10 ModularInverse 
Input: integers b and N, b prime to N 
Output: integer u = 1/b mod N 
(u,w) — (1,0),c— N 
while c 4 0 do 
(q,r) — DivRem(), c) 
(b,c) — (er) 
(u,w) — (w,u— qu) 


return wu. = 
[-] 


which can be seen as the p-adic variant of Newton’s method (§4.2). To compute 
1/b mod N, we use a p-adic version of the iteration (4.5) 


Ljy1 = xj +2;(1 — bx;) mod pr. (2.3) 


Assume x; approximates 1/b to “p-adic precision” @, ie. bx; = 1+ ep", and 
k = 2¢. Then, modulo p*: brj41 = bx;(2 — bxj) = (1+ ep*)(1 — ep*) = 
1—e?p?’. Therefore, Xj41 approximates 1/b to double precision (in the p-adic 
sense). 

As an example, assume we want to compute the inverse of an odd integer b 
modulo 2°”. The initial approximation x9 = 1 satisfies 19 = 1/b mod 2, thus 
five iterations are enough. The first iteration is x1 — x9 +29(1—bao) mod 2?, 
which simplifies to 7; — 2 — b mod 4 since #9 = 1. Now, whether b = 1 
mod4 or b = 3 mod 4, we have 2 — b = b mod 4; we can therefore start the 
second iteration with x; = b implicit 


xq ~ b(2 — b?) mod 2%, 23 < 2(2 — br) mod 28, 


x4 — %3(2 —bx3) mod 2'°, x5 — a4(2 — baa) mod 2°”. 


Consider for example b = 17. The above algorithm yields rz = 1, x3 = 241, 
v4 = 61681 and 25 = 4042322161. Of course, any computation mod p! 
might be computed modulo p* for k > 2. In particular, all the above compu- 
tations might be performed modulo 2°”. On a 32-bit computer, arithmetic on 
basic integer types is usually performed modulo 2°”, thus the reduction comes 
for free, and we can write in the C language (using unsigned variables and 
the same variable x for 72,..., 5) 


xX = b*e(2-b*xb); x *= 2-b*ex; x *= 2-b*x; K *= 2-b*x; 


Another way to perfdrm mlodular division when the modulus has a special 
form is Hensel’s division (§1.4.8). For a modulus N = 3”, given two integers 
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A, B, we compute Q and R such that 
A=QB+ Rp". 


Therefore, we have A/B = Q mod £”. While Montgomery’s modular mul- 
tiplication only computes the remainder R of Hensel’s division, modular divi- 
sion computes the quotient Q; thus, Hensel’s division plays a central role in 
modular arithmetic modulo 3”. 


Ce] 
2.5.1 Several inversions at once 


A modular inversion, which reduces to an extended gcd (§1.6.2), is usually 
much more expensive than a multiplication. This is true not only in the FFT 
range, where a gcd takes time O(M(n) log n), but also for smaller numbers. 
When several inversions are to be performed modulo the same number, Algo- 
rithm MultipleInversion is usually faster. 


Algorithm 2.11 MultipleInversion 
Input: 0 < 2,...,a%%.< N 
Output: y; = 1/2; mod N,..., yx = 1/x, mod N 


1: 21-21 

2: for i from 2 to k do 

3: 2; — 2-12; mod N 
4: q— 1/z, mod N 

5: for i from k downto 2 do 
6 Yi — Gz%i-1 mod N 
7 q <— qx; mod N 

8 


7 Yl — |. 


Theorem 2.7 Algorithm MultipleInversion is correct. O 


Proof. We have z; = x1 %2...x; mod N;; thus, at the beginning of step 6 for 
a given i, q = (x;...2;)~+ mod N, which gives yf} 1/x; mod N. 


This algorithm uses only one modular inversion (step 4), and 3(k — 1) modular 
multiplications. Thus, it is faster than / inversions whe odular inversion is 
more than three times as expensive as a product. Figure 2.1 shows a recursive 
variant of the algorithm, with the same number of modular multiplications: one 
for each internal node when going up the (product) tree, and two for each in- 
ternal node when going down the (remainder) tree. The recursive variant might 
be performed in parallel in O(log k) operations using O(k/ log k) processors. 
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1/(@1%2%304) 


1/(x12%2) 1/(x3x4) 


1/21 1/x2 1/x3 1/x4 


Figure 2.1 A recursive variant of Algorithm MultipleInversion. First go 
up the tree, building x;22 mod N from 2; and x2 in the left branch, 
z3x%4 mod N in the right branch, and x1 %2%324 mod N at the root of the 
tree. Then invert the root of the tree. Finally, go down the tree, multiplying 
1/(x1%2x3%4) by the stored value 324 to get 1/(x1x2), and so on. 


A dual case is when there everal moduli but the number to invert is 
fixed. Say we want to sca mod Nj,...,1/a mod N,. We illustrate 
a possible algorithm in the case k = 4. First compute N = Nj... Nj; using 
a product tree like that in Figure 2.1. For example, first compute Nj N2 and 
N3N4, then multiply both to get N = (Ni N2)(N3N4). Then compute y = 
1/x mod N, and go down the tree, while reducing the residue at each node. In 
our example, we compute z = yloold (.N;.N2) in the left branch, then z mod 
N, yields 1/a% mod N,. An important difference between this algorithm and 
the algorithm illustrated in Figure 2.1 is that here the numbers grow while 
going up the tree. Thus, depending on the sizes of x and the N;, this algorithm 
might be of theoretical interest only. 


2.6 Modular exponentiation 


Modular exponentiation is the most time-consuming mathematical operation 
in several cryptographic algorithms. The well-known RSA public-key cryp- 
tosystem is based on the fact that computing 


c=a° mod N (2.4) 


is relatively easy, but recovering a from c, e and N is difficult when_N has 
at least two (unknown) large prime factors. The discrete logarithm prbbidm is 
similar: here c, a and N are given, and we look for e satisfying Eqn. (2.4). In 
this case, the problem is difficult when N has at least one large prime factor 
(for example, VV could be prime). The discrete logarithm problem is the basis 
of the E] Gamal cryptosystem, and a closely related problem is the basis of the 
Diffie-Hellman key exchange protocol. 
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When the exponent e is fixed (or known to be small), an optimal sequence 
of squarings and multiplications might be computed in advance. This is related 
to the classical addition chain problem: What is the smallest chain of additions 
to reach the integer e, starting from 1? For example, if e = 15, a possible chain 
is 

114+1=2,14+2=3,14+3=4,34+4=7,7+7=14,1+14=15. 


The length of a chain is defined to be the number of additions needed to com- 
pute it (the above chain has length 6). An addition chain readily translates to a 
multiplication chain 


a,4-a = a",a-a7 = a°,a-a° = a*,a?-a* =a",a"-a’ = a"*,a-a'* = 0". 


A shorter chain for e = 15 is 
1,14+1=2,14+2=3,2+3=5,54+5=10,5+10=15. 


This chain is the shortest possible for e = 15, so we write 0(15) = 5, where in 
general o(e) denotes the length of the shortest addition chain for e. In the case 
where e is small, and an addition chain of shortest length o(e) is known for e, 
computing a® mod N may be performed in o(e) modular multiplications. 

When e is large and (a, NV) = 1, then e might be reduced modulo (NV), 
where (JV) is Euler’s totient function, i.e. the number of integers in [1, N] 
which are relatively prime to N. This is because a?) = 1 mod N whenever 
(a, N) = 1 (Fermat’s little theorem). 

Since ¢(NV) is a multiplicative function, it is easy to compute o(V) if we 
know the prime factorization of N. For example 


¢(1001) = ¢(7- 11-13) = (7—1)(11 — 1)(13 — 1) = 720, 


and 2009 = 569 mod 720, so 177°? = 17° mod 1001. 

Assume now that e is smaller than @(V). Since a lower bound on the length 
a(e) of the addition chain for e is lg e, this yields a lower bound (lg e)M(n) 
for modular exponentiation, where n is the size of N. When e is of size k, a 
modular exponentiation costs O(kM(n)). For k = n, the cost O(n (n)) of 
modlar exponentiation is much more than the cost of operations considered in 
Chapter 1, with O(M/ i ~ n) for the more expensive ones there. The differ- 
ent algorithms present 
to binary exponentiation (§2.6.1). 


is section save only a constant factor compared 


REMARK: when a fits in one word but NV does not, the shortest addition chain 
for e might not be the best way to compute a® mod _N, since in this case com- 
puting a- a) mod N is cheaper than computing a’ - a/ mod N fori > 2. 
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2.6.1 Binary exponentiation 


A simple (and not far from optimal) algorithm for modular exponentiation is 
binary (modular) exponentiation. Two variants exist: left-to-right and right-to- 
left. We give the former in Algorithm LeftToRightBinaryExp and leave the 
latter as an exercise for the reader. 


Algorithm 2.12 LeftToRightBinaryExp 
Input: a,e, N positive integers 
Output: « = a° mod N 
1: let (egeg_ 1... e169) be the binary representation of e, with eg = 1 


2: 2a 

3: for i from ¢ — 1 downto 0 do 

4: x <— x? mod N 

5: if ec; = 1 then z — az mod N. 


Left-to-right binary exponentiation has two advantages over right-to-left 
exponentiation: 


e it requires only one auxiliary variable, instead of two for the right-to-left 
exponentiation: one to store successive values of a2, arldlone to store the 
result; 

e in the case where a is small, the multiplications ax at step 5 always involve 
a small operand. 


If e is a random integer of + 1 bits, step 5 will be performed on average ¢/2 
times, giving average cost 30M (n) /2. 
EXAMPLE: for the exponent e = 3 499 211 612, which is 


(11010 000 100 100 011 011 101 101 011 100). 


in binary, Algorithm LeftToRightBinaryExp performs 31 squarings and 15 
multiplications (one for each 1-bit, except the most significant one). 


2.6.2 Exponentiation with a larger base 


Compared to binary exponentiation, base 2* exponentiation reduces the 
[nlimber of multiplications az mod N (Algorithm LeftToRightBinaryExp, 
step 5). The idea is to precompute small powers of a mod N: 

The precomputation cost is (2* — 2)M/(n), and if the digits e; are random 
[ahd uniformly distributed in Z M [0,2*), then the modular multiplication at 
step 6 of BaseKExp is performed with probability 1 — 2—*. If e has n bits, the 
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Algorithm 2.13 BaseKExp 
Input: a,e, N positive integers 
Output: xz = a° mod N 
: precompute ¢[é] := a’ mod N for 1 <i < 2* 
: let (egep_1 ...€1€9) be the base 2” representation of e, with er 4 0 


1 
2 
3: o — tlep| 

4: for 7 from ¢ — 1 downto 0 do 

5 a<—x? mod N 

6 ife; A 0 then x <— t[e;|z mod N. O 


number of loops is about n/k. Ignoring the squares at step 5 (their total cost 
depends on kf = n so is independent of k), the total expected cost in terms of 
multiplications modulo N is 


a* —2+n(1—27*)/k. 


For k = 1, this formula gives n/2; for k = 2, it gives 3n/8 + 2, which is faster 
for n > 16; for k = 3, it gives 7n/24 + 6, which is faster than the k = 2 
formula for n > 48. When n is large, the optimal value of k satisfies k?2" ~ 
n/\n2. A minor disadvantage of this algorithm is its memory usage, since 
@(2") precomputed entries have to be stored. This is not a serious problem if 
we choose the optimal value of k (or a smaller value), because then the number 
of precomputed entries to be stored is o(7). 


EXAMPLE: consider the exponent e = 3499211612. Algorithm BaseKExp 
performs 31 squarings independently of k, we therefore count multiplications 
only. For k = 2, we have e = (3 100 210 123 231 130)4: Algorithm BaseKExp 
performs two multiplications to precompute a? and a?, and 11 multiplications 
for the non-zero digits of e in base 4 (except for the leading digit), i.e. a total 

13. For k = 3, we have e = (32 044335 534), and the algorithm performs 
Six multiplications to precompute a?,a?,...,a’, and nine multiplications in 
step 6, i.e. a total of 15. 

The last example illustrates two facts. First, if some digits (here 6 and 7) do 
not appear in the base-2" representation of e, then we do not need to precom- 
pute the corresponding powers of a. Second, when a digit is even, say e; = 2, 
instead of doing three squarings and multiplying by a”, we could do two squar- 
ings, multiply by a, and perform a last squaring. These considerations lead to 
Algorithm BaseKExp 

The correctness of steps 7—9 follows from: 


ko om k—m m 
2 az De (x? ary . 
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Algorithm 2.14 BaseKExpOdd 
Input: a,e, N positive integers 
Output: z« = a° mod N 


if e; | 0 then x < t[d|z mod N 
za?” mod N. 


O 


On the sirtond example, with k = 3, this algorithm performs only four 
multiplications in step 1 (to precompute a? then a?,a°, a"), then nine multi- 
plications in step 8. 


1: precompute a? then t[i] := a’ mod N for i odd, 1 <i < ok 

2: let (egep_1 ...€1€9) be the base 2” representation of e, with er 4 0 
3: write eg = 2’"d with d odd 

4.2<-t[d], 22?" mod N 

5: for i from ¢ — 1 downto 0 do 

6: write e; = 2’’d with d odd (if e; = 0 then m = d = 0) 

7: 2-2” mod N 

8: 

9: 


2.6.3 Sliding window and redundant representation 


The “sliding window” algorithm is a straightforward generalization of 
Algorithm BaseKExpOdd. Instead of cutting the exponent into fixed parts 
of k bits each, the idea is to divide it into windows, where two adjacent win- 
dows might be separated by a block of zero or more 0-bits. The decomposition 
starts from the least significant bits. For example, with e = 3499 211612, or 
in binary 
1 101,00 001 001 00 011 011 101 1010 111 00. 
wenn ORNS erm 


eg e7 e€6 e5 e4 e3 e2 e1 e€0 


Here there are nine windows (indicated by eg, ...,e9 above) and we perform 
only eight multiplications, an improvement of one multiplication over Algo- 
rithm BaseKExpOdd. On average, the sliding window base 2* algorithm leads 
to about n/(k + 1) windows instead of n/k with fixed windows. 

Another improvement may be feasible when division is feasible (and cheap) 
in the underlying group. For example, if we encounter three consecutive ones, 
say 111, in the binary representation of e, we may replace some bits by —1, 
denoted by 1, as in 1001. We have thus replaced three multiplications by one 
multiplication and one division, in other words x? = 2° -x~!. For our running 
example, this gives 


e = 11010000 100 100 100 100 010 010 100 100, 
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which has only ten non-zero digits, apart from the leading one, instead of 
15 with bits 0 and 1 only. The redundant representation with bits {0, 1, 1} is 
called the Booth representation. It is a special case of the Avizienis signed-digit 
redundant representation. Signed-digit representations exist in any base. 

For simplicity, we have not distinguished between the cost of multiplica- 
tion and the cost of squaring (when the two operands in the multiplication are 
known to be equal), but this distinction is significant in some applications (e.g. 
elliptic curve cryptography). Note that, when the underlying group operation 
is denoted by addition rather than multiplication, as is usually the case for 
abelian groups (such as groups defined over elliptic curves), then the discus- 
sion above applies with “multiplication” replaced by “addition”, “division” by 
“subtraction”, and “squaring” by “doubling”. 


2.7 Chinese remainder theorem 


In applications where integer or rational results are expected, it is often worth- 
while to use a “residue number system” (as in §2.1.3) and perform all compu- 
tations modulo several small primes (or pairwise coprime integers). The final 
result can then be recovered via the Chinese remainder theorem (CRT). For 
such applications, it is important to have fast conversion routines from integer 
to modular representation, and vice versa. 

The integer to modular conversion problem is the following: given an integer 
x, and several pairwise coprime moduli m;, 1 < 4 _k, how do we effidiently 
compute x; = x mod mj, for 1 < i < k? This is the remainder tree problem of 
Algorithm IntegerToRNS, which is also discussed in §2.5.1 and Exercise 1.35. 


Algorithm 2.15 IntegerToRNS 
Input: integer x, moduli m1, m2,..., mx pairwise coprime, k > 1 
Output: x; =x modm; forl<i<k 

: ifk < 2 then 


1 
2 return 7] = x mod m,..., Tx = Y mod mz 

3: £— |k/2| 

4. My —myme--:me, Me — me4i- ++ me > might be precomputed 
5: %,...,2¢ — IntegerToRNS(x mod My, m1,..., me) 

6: Up41,---,L~ — IntegerToRNS(xz mod Mz, me41,..., Mx). 


If all moduli m,; have the same size, and if the size n of x is comparable to 
that of the product mymz--- mz x, the cost T(k) of Algorithm IntegerToRNS 
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satisfies the recurrence T'(n) = 2D(n/2) + 2T(n/2), which yields T(n) = 
O(M(n) log n). Such a conversion is therefore more expensive than a multipli- 
cation or division, and is comparable in complexity terms to a base conversion 
or a gcd. 

The converse CRT reconstruction protfl¢m is the following: given the 2, 
how do we efficiently reconstruct the unique integer £,0 < x < mymz---mMkz, 
such that x = 7; mod m,, for 1 <7 < k? Algorithm RNSToInteger performs 
that conversion, where the values wu, v at step 7 might be precomputed if several 
conversions are made with the same moduli, and step 11 ensures that the final 
result x lies in the interval [0, 1) M2). 


Algorithm 2.16 RNSToInteger 
Input: residues x7;, 0 < «; < m, for 1 <7 < k, m; pairwise coprime 
Output: 0 < x < mymz2--- me, with c = x; mod m; 


1: if k = 1 then 

2 return 71 

3: £— |k/2| 

4. My — myme-++me, Mg — meyi- + MK > might be precomputed 
5: X, — RNSTolnteger((z),..., xe], [m1,...,mze]) 

6: Xg<— RNSTolnteger((77+1, ae OKs [me41, crises me]) 

7: compute u,v such that uy + vuM2 = 1 > might be precomputed 
8: Ay — uX_g mod Mz, Ag—vX; mod M, 

9: 2 — \yM, + A2Mo 


10: ifx > M, Mo then 
11: a—a—-M, Mo. 


To see that Algorithm RNSToInteger is correct, consider an integer 7, 1 < 
i < k, and show that x = x; mod m,;. If k = 1, it is trivial. Assume k > 2, 
and without loss of generality 1 < i < ¢. Since M, is a multiple of m;, we 
have 2 mod m; = (2 mod M;) mod mj, where 


zx mod M,; = ’2M2 mod M, = vX1;M2 mod M, = X; mod M;, 


and the result follows from the induction hypothesis that X, = x; mod m,. 
Like IntegerToRNS, Algorithm RNSTolInteger costs O(M(n) log n) for 
M =mm2:-- my, of size n, assuming that the m; are of equal sizes. 
The CRT reconstruction problem is analogous to the Lagrange polynomial 
interpolation problem: find a polynomial of minimal degree interpolating given 
values x; at k points m,. 
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A “flat” variant of the explicit Chinese remainder reconstruction is the 
following, taking for example k = 3 


LS AL + A2x2 + A323, 


where A; = 1 mod m,, and A; = 0 mod m, for 7 4 7. In other words, A; is 
the reconstruction of x; = 0,...,%;-1 = 0,4; = 1,vji41 = 0,..., 0% = 0. 
For example, with m; = 11, mz = 13 and m3 = 17, we get 


x = 22147, + 1496x2 + 71523. 


To reconstruct the integer corresponding to 24 2, r9 3, ©3 4, we 
get v = 221-2+ 1496-3+4 715-4 = 7790, which after reduction modulo 
11-13-17 = 2431 gives 497. 


= 


2.8 Exercises 


Exercise 2.1 In §2.1.3 we considered the representation of non-negative inte- 
gers using a residue number system. Show that a residue number system can 
also be used to represent signed integers, provided their absolute values are not 
too large. (Specifically, if relatively prime moduli m1, ™mo,..., 7m, are used, 
and B = m m2--- mx, the integers x should satisfy |x| < B/2.) 


[___ Exercise 2.2 Suppose two non-negative integers x and y are represented by 
their residues modulo a set of relatively prime moduli m ,mg,..., mx as in 
§2.1.3. Consider the comparison problem: is x < y? Is it necessary to compver} 
x and y back to a standard (non-CRT) representation in order to answer this 
question? Similarly, if a signed integer x is represented as in Exercise 2.1, 
consider the sign detection problem: is x < 0? 


Exercise 2.3 Consider the use of redundant moduli in the Crmmese remainder 
representation. In other words, using the notation of Exercise 2.2, consider the 
case that x could be reconstructed without using all the residues. Show that this 
could be useful for error detection (and possibly error correction) if arithmetic 
operations are performed on unreliable hardware. 


Exercise 2.4 Consider the two complexity] bounds O(M(dlog(Nd))) and 
O(M(d)M (log N)) given at the end of §2.1.5. Compare the bounds in three 
cases: (a)d < N;(b)d~ N; (c)d > N. Assume two subcases for the mul- 
tiplication algorithm: (i) M(n) = O(n?); (ii) M(n) = O(nlogn). (For the 
sake of simplicity, ignore any log log factors.) 
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Exercise 2.5 Show that, if a symmetric representation in [—NV/2, N/2) is used 
in Algorithm ModularAdd (§2.2), then the probability that we need to add or 
i Soe) is 1/4 if N is even, and (1 — 1/N?)/4 if N is odd (assuming in 
bot! s that a and ; ie ema distributed). 


Exercise 2.6 Write down the complexity of the Montgomery—Svoboda algo- 
rithm (§2.4.2, page 61) for k steps. For k = 3, use van der Hoeven’s relaxed 
Karatsuba multiplication [124] to save one M(n/3) product. 


Exercise 2.7 Assume you have an FFT algorithm computing products modulo 
2” +1. Prove that, with some preconditioning, you can perform a division with 
remainder of a 2n-bit integer by an n-bit integer as fast as 1.5 multiplications 
of n bits by n bits. 


Exercise 2.8 Assume you know p(x) mod (#”!—1) and p(x) mod (a2"?—1), 
where p(x) € F'[a] has degree n—1, and n > ng, and F isa field. Up to which 
value of n can you uniquely reconstruct p? Design a corresponding algorithm. 


Exercise 2.9 Consider the problem of computing the Fourier transform of a 
vector a = [a0, @1,...,@«—1], defined in Eqn. (2.1), when the size K is not a 
power of two. For example, A might be an odd prime or an odd prime power. 
Can you find an algorithm to do this in O(K log kK’) operations? 


Exercise 2.10 Considpr the problem of computing the cyclic convolution of 
two /-vectors, where KX is not a power of two. (For the definition, with Ix 
replaced py IV, see §3.3.1.) Show that the cyclic convolution can be computed 
using FFTs on 2* points for some suitable \, or by using DFTs on K points 
(see Exercise 2.9). Which method is better? 


Exercise 2.11 Devise a parallel version of Algorithm MultipleInversion as 
outlined in 2.5.1. Analyse its time and space complexity. Try to minimize the 
number of parallel processors required while achieving a parallel time com- 
plexity of O(log k). 


Lexerlcise 2.12 Analyse the complexity of the algorithm outlined at the end 
of §2.5.1 to compute 1/2 mod Nj,...,1/2 mod Nx, when all the N; have 
size n, and x has size ¢. For which values of n, @ is it faster than the naive 
algorithm which computes all modular inverses separately? [Assume M (n) is 
quasi-linear, and neglect multiplicative constants. ] 


Exercise 2.13 Write a RightToLeftHinaryExp algorithm and compare it with 
Algorithm LeftToRightBinaryExp of §2.6.1. 
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Exercise 2.14 Investigate heuristic algorithms for obtaining close-to-optimal 
addition (or multiplication) chains when the cost of a general addition a + b 
(or multiplication a - b) is X times the cost of duplication a + a (or squaring 
a-a), and A is some fixed positive constant. (This is a reasonable model for 
modular exponentiation, because multiplication mod N is generally more ex- 
pensive than squaring mod JN. It is also a reasonable model for operations in 
groups defined by elliptic curves, since in this case the formulz for addition 
and duplication are usually different and have different costs.) 


2.9 Notes and references 


Several number-theoretic algorithms make heavy use of modular arithmetic, in 
particular integer factorization algorithms (for example: Pollard’s p algorithm 
and the elliptic curve method). 

Another important application of ar arithmetic in computer algebra 
is computing the roots of a aera es Wk Daher over a finite field, which 
requires Faciock arithmetic over F,,[x]. See for example the excellent book 
“MCA” by von zur Gathen and Gerhard [100]. 

We say in §2.1.3 that residue number systems can only be used when N 
factors into Ny; N2 ...; this is not quite true, since Bernstein and Sorenson show 
in [24] how to perform modular arithimdtic using a residue number system. 

For notes on the Kronecker—Schénhage trick, see $1.9. 

Barrptt’s| algorithm is described in [14], which also mentions the idea of 
using two short products. The original description of Montgomery’s REDC al- 
gorithm is [169]. It is now Widely used in several appli¢ ations. However, only 
a few authors const using a reduction factor which is not of the form 

> ai ong them McLaughlin [160] and Mihailescu [164]. The Montgomery— 
Svoboda algorithm (§2.4.2) is also called “Montgomery tail tayloring” by 
Hars [113], who attributes Svoboda’s algeri — more precisely its variant 
with the most significant word being 3 — ead of 3 — to Quisquater. The 
folding optimization of REDC described in §2.4.2 (Subq tic Montgomery 
Reduction) is an LSB-extension of the algorithm descri the context of 
Barrett’s algorithm by Hasenp , Gaubatz, and Gopal [118]. Amongst the 
algorithms not covered in this , we mention the “bipartite modular multi- 
plication” of Kaihara and Takagi [134], which involves performin h MSB- 
and LSB-division in parallel. — i 

The description of McLaughlin’s algorithm in §2.4.3 follows [160, Varia- 
tion 2]; Mcl aughlin’s algorithm was reformulated in a polynomial context by 
Mihailescu [164]. 
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iin les liots have proposed FFT algorithms, or improvements oksudh al- 

oe and applications such as fas putation of convolutions. Some 

ces ho, Hopcroft, and Ullman [3]; Nussbaumer [176]; Borodin 
and sti ere who describe the polynomial approach; Van Loan [222] for 
the lingarl algebra approach; and Pollard [185] for the FFT ovdr nite fields. 
Rader [187] considered the case where the number of data points is a prime, 
and Winograd [230] generalized Rader’s algorithm to prime powers. Bluestein’s 
algorithm [30] is also applicable in these cases. In Bernstein [22, §23] the 
readér will find some historical remarks and several nice applications of the 
FFT. 

The Schénhage-Strassen algorithm first appeared in [199]. Recently, 
Fiirer [98] has proposed an integer multiplication a ae that is asymptoti- 
cally faster than the Schénhage-Strassen algorithm. Firer’s algorithm almost 
achieves the conjectured best possible O(n log n ing time. 

Concerning special moduli, Percival Srl ae 2 the case N =axb, 
where-boeth a and b are highly composite; this is a generalization of th e 
N= + 1. The pseudo-Mersenne primes of §2.4.4 are cappaane ik 8 
the National itute of Standards and Technology (NIST) Digital Signature 
Standard ne also the book by Hankerson, Menezes, and Vanstone [110]. 

Algorithm MultipleInversion — also known as “batch inversion” — is due 
to Montgomery [170]. The application of Barrett’s algorithm for an implicitly 
invariant divisor was suggested by Granlund. 

Modular exponentiation and cryptographic algorithms are described in much 
detail in the book by Menezes, van Oorschot, and Vanstone [161, Chapter 14]. 
A detailed description of the best theoretical algorithms, with references, can 
be found in Bernstein [18]. When both the modulus and base are invariant, 
modular exponentiation with k-bit exponent and n-bit modulus can be per- 
formed in time O logk)M(n)), after a precomputation of O(k/ log k) 
powers in time O(kM(n)). Take for example b = 2*/* in Note 14.112 and 
Algorithm 14.109 of [161], with tlogt ~ k, where the powers a’ mod N 
foy-0)< 7 < t are precomputed. An algorithm of same complexity using a 
D (Double-Base Number System) was proposed by Dimitrov, Jullien, and 
Miller [86], however-with a larger table of O(k”) precomputed powers. 

Original papers Wale Dae recoding, SRT division, etc., are reprinted ti 
book by Swartzlander [212]. 

A quadratic ithm for CRT reconstruction is discussed in Cohen [73]; 
Miller gives = con een in the case of a small number of small moduli 
known ee [167]. Algorithm IntegerToRNS can be found in Borodin 
and Moenck [34]. The explicit Chinese remainder theorem and its applications 
to modular exponentiation are discussed by Bernstein and Sorenson in [24]. 
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Floating-point arithmetic 


This chapter discusses the basic operations — adttion, subtrac- 
tion, multiplication, division, square root, conversion — arbi- 
trary precision floating-point numbers, as Chapter | does for ar- 
bitrary precision integers. More advanced functions such as el- 
ementary and special functions are covered in Chapter 4. This 
chapter largely follows the IEEE 754 standard, and extends it in 
a natural way to arbitrary precision; deviations from IEEE 754 
are explicitly mentioned. By default, IEEE 754 refers to the 2008 
revision, known as IEEE 754-2008; we write IEEE 754-1985 
when we explicitly refer to the 1985 initial standard. Topics 
not discussed here include: hardware implementations, fixed- 
precision implementations, special representations. 


3.1 Representation 


The classical non-redundant representation of a floating-point nubiber x in 
radix 3 > 1 is the following (other representations are discussed in $3.8): 


x =(-1)*-m- BF, (3.1) 


where (—1)*, s € {0,1}, is the sign, m > 0 is the significand, and the integer 
e is the exponent of x. In addition, a positive integer n defines the precision of 
x, which means that the significand m contains at most n significant digits in 
radix (. 

An important special case is m = 0 representing zero. In this case, the sign 
s and exponeiht_¢ are irrelevant and may be used to encode other information 
(see for example §3.1.3). 

For m # 0, several semantics are possible; the most common ones are: 
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e B-+<m <1, then 8°! < |2| < 8°. In this case, m is an integer multiple 
of 3~”. We say that the unit in the last place of x is 3°~”, and we write 
ulp(#) = @°~”. For example, 7 = 3.1416 with radix G = 10 is encoded 
by m = 0.31416 and e = 1. This is the convention that we will use in this 
chapter. 

e 1<m< 8, then £* < |x| < 8°t!, and ulp(x) = 9¢*!~”. With radix ten 
the number x = 3.1416 is encoded by m = 3.1416 and e = 0. This is the 
convention adopted in the IEEE 754 standard. 

e We can also use an integer significand 6"~! < m < 6", then Bet"-! < 
|z| < 6°", and ulp(x) = 6°. With radix ten the number x = 3.1416 is 
encoded by m = 31416 and e = —4. 


Note that in the above three cases, there is only one possible representation of 
a non-zero floating-point number: we have a canonical representation. In some 
applications, it is useful to relax the lower bound on non-zero m, which in the 
three cases above gives respectively 0 << m<1,0<m< $,and0<m< 
8”, with m an integer multiple of 6°", G°*!~", and 1 respectively. In this 
case, there is no longer a canonical representation. For example, with an integer 
significand and a precision of five digits, the number 3.1400 might be encoded 
by (m = 31400, e = —4), (m = 03140, e = —3), or (m = 00314, e = —2). 
This non-canonical representation has the drawback that the most significant 
non-zero digit of the significand is not known in advance. The unique encoding 
with a non-zero most significant digit, i.e. (m = 31400, e = —4) here, is called 
the normalized — or simply normal — encoding. 

The significand is also sometimes called the mantissa or fraction. The above 
examples demonstrate that the different significand semantics correspond to 
different positions of the decimal (or radix (3) point, or equivalently to different 
biases of the exponent. We assume in this chapter that both the radix (@ and the 
significand semantics are implicit for a given implementation, and thus are not 
physically encoded. 

The words “base” and “radix” have similar meanings. For clarity, WeTeserve 
“radix” for the constant 3 in a floating-point representation, such as (3.1). The 
significand m and exponent e might be stored in a different base, as discussed 
below. 


3.1.1 Radix choice 


Most floating-point implementations use radix @ = 2 or a power of two, 
because this is convenient and efficient on binary computers. For a radix (3, 
which is not a power of 2, two choices are possible: 
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e Store the significand in base 3, or more generally in base 3" for an integer 
k > 1. Each digit in base 8" requires [k lg @] bits. With such a choice, indi- 
vidual digits can be accessed easily. With G = 10 and k = 1, this is the “Bi- 
nary Coded Decimal” or BCD encoding: each decimal digit is represented 
by four bits, with a memory loss of about 17% (since lg(10)/4 * 0.83). A 
more compact choice is radix 10°, where three decimal digits are stored in 
ten bits, instead of in 12 bits with the BCD format. This yields a memory 
loss of only 0.34% (since 1g(1000)/10 ~ 0.9966). 

e Store the significand in binary. This idea is used in Intel’s Binary-Integer 
Decimal (BID) encoding, and in one of the two decimal encodings in IEEE 
754-2008. Individual digits can not be accessed directly, but we can use effi- 
cient binary hardware or software to perform operations on the significand. 


A drawback of the binary encoding is that, during the addition of two arbitrary- 
precision numbers, it is not easy to detect if the significand exceeds the max- 
imum value 3” — 1 (when considered as_an integer) and thus if rounding is 
required. Either 3” is precomputed; ne va realistic if all computations 
involve the same precision n, or it is computed on the fly, which might result 


in increased complexity (see Chapter | and §2.6.1). 


3.1.2 Exponent range oO 


In principle, we might consider an unbounded exponent. In other words, the 
exponent e might be encoded by an arbitrary-precision integer (see Chapter 1). 
This would have the great advantage that no underflow or overflow could occur 
(see below). However, in most applications, an exponent encoded in 32 bits is 
more than enough: this enables us to represent values up to about 10°46 46 993 
for 3 = 2. A result exceeding this value most probably corresponds to an error 
in the algorithm or the implementation. Using arbitrary-precision integers for 
the exponent induces an extra overhead that slows down the implementation in 
the average case, and it usually requires more memory to store each number. 
Thus, in practice the exponent nearly always has a limited range emin < 
e < €max. We say that a floating-point number is representable if it can be 
represented in the form (—1)* - m- 8° with emin < € < €max- The set of 
representable numbers clearly depends on the significand semantics. For the 
convention we use here, i.e. G@~' < m < 1, the smallest positive representable 
floating-point number is G¢™=—1, and the largest one is B°™==(1 — B-"). 
Other conventions for the significand yield different exponent ranges. For 
example, the double-precision format — called binary 64 in IEEE 754-2008 — 
has @min = —1022, Cmax = 1023 for a significand in [1, 2); this corresponds to 
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€min = —1021, emax = 1024 for a significand in [1/2, 1), and ein = —1074, 
€max = 971 for an integer significand in [2°, 2°%). 


3.1.3 Special values 


With a bounded exponent range, if we want a complete arithmetic, we need 
some special values to represent very large and very small values. Very small 
values are naturally flushed to zero, which is a special number in the sense that 
its significand is m = 0, which is not normalized. For very large values, it 
is natural to introduce two special values —oo and +00, which encode large 
non-representable values. Since we have two infinities, it is natural to have two 
zeros —0 and +0, for example 1/(—oo) = —0 and 1/(+00) = +0. This is the 
IEEE 754 choice. Another possibility would be to have only one infinity and 
one zero 0, forgetting the sign in both cases. 

An additional special value is Not a Number (NaN), which either represents 
an uninitialized value, or is the result of an invalid operation like V—1 or 
(+00) — (+00). Some implementations distinguish between different kinds of 
NaN, in particular IEEE 754 defines signaling and quiet NaNs. 


3.1.4 Subnormal numbers 


Subnormal numbers are required by the IEEE 754 standard, to allow what is 
called gradual underflow between the smallest (in absolute value) non-zero 
normalized numbers and zero. We first explain what subnormal numbers are; 
then we will see why they are not necessary in arbitrary precision. 

Assume we have an integer significand in [3”—', 8"), where n is the pre- 
cision, and an exponent in [emin, €max]. Write 7 = 3°». The two smallest 
positive normalized numbers are x = 8"~!n and y = (81 + 1)n. The 
difference y — x equals 7, which is tiny compared to «. In dcular, y — x 
can not be represented exactly as a normalized number ae ar Br-t > 1) 
and will be rounded to zero in “rounding to nearest” mode (§3.1.9). This has 
the unfortunate consequence that instructions such as 


if (y != x) then 
z= 1.0/(y -— x); 


will produce a “division by zero” error when executing 1.0/ (y - x). 
Subnormal numbers solve this problem. The idea is to relax the condition 

B"-! < m for the exponent ein. In other words, we include all numbers 

of the form m - 8° for 1 < m < 6"~1 in the set of valid floating-point 


3.1 Representation 83 


numbers. We could also permit m = 0, and then zero would be a subnormal 
number, but we continue to regard zero as a special case. 

Subnormal numbers are all positive integer multiples of +7, with a mullti- 
plier m, 1 < m < £"7!. The difference between x = (6"~'n and 
y = (8"—1 +1) is now representable, since it equals 17, the smallest positive 
subnormal number. More generally, all floating-point numbers are multiples of 
7, likewise for their sum or difference (in other words, operations in the sub- 
normal domain correspond to fixed-point arithmetic). If the sum or difference 
is non-zero, it has magnitude at least 7, and thus can not be rounded to zero. 
Therefore, the “division by zero” problem mentioned above does not occur 
with subnormal numbers. 

In the IEEE 754 double-precision format — called binary 64 in IEEE 754- 


2008 — the smallest positive normal number is 2~ 1022 
1074 


, and the smallest positive 
subnormal number is 27 . In arbitrary precision, subnormal numbers sel- 
dom occur, since usually the exponent range is huge compared to the expected 
exponents in a given application. Thus, the only reason for implementing sub- 
normal numbers in arbitrary precision is to provide an extension of IEEE 754 
arithmetic. Of course, if the exponent range is unbounded, then there is ab- 
solutely no need for subnormal numbers, because any non-zero floating-point 


number can be normalized. 


3.1.5 Encoding 


The encoding of a floating-point number « = (—1)* -m - @° is the way the 
values s,m, and e are stored in the computer. Remember that 3 is implicit, i.e. 
is considered fixed for a given implementation; as a consequence, we do not 
consider here mixed radix operations involving numbers with different radices 
Band fi’. 

We have already seen that there are several ways to encode the significand 
m when 3 is not a power of two, in base-@" or in binary. For normal numbers 
in radix 2,i.e.2”-! < m < 2”, the leading bit of the significand is necessarily 
one, thus we might choose not the encode it in memory, to gain an extra bit 
of precision. This is called the implicit leading bit, and it is the choice made 
in the IEEE 754 formats. For example, the double-precision format has a sign 
bit, an exponent field of 11 bits, and a significand of 53 bits, with only 52 bits 
stored, which gives a total of 64 stored bits: 


sign | (biased) exponent significand 
(1 bit) (11 bits) (52 bits, plus implicit leading bit) 
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A nice consequence of this particular encoding is the following. Let x be a 
double-precision number, neither subnormal, +00, NaN, nor the largest normal 
number in absolute value. Consider the 64-bit encoding of x as a 64-bit integer, 
with the sign bit in the most significant bit, the exponent bits in the next most 
significant bits, and the explicit part of the significand in the low significant 
bits. Adding 1 to this 64-bit integer yields the next double-precision number 


to x, away from zero. Indeed, if the significand m is smaller than 253 1m 
becomes m + 1, which is smaller than 2°°. If m = 25° — 1, then the lowest 
52 bits are all set, and a carry occurs between the significand field and the 
exponent field. Since the significand field becomes zero, the new significand is 
2°?, taking into account the implicit leading bit. This corresponds to a change 
from (2°° — 1) - 2° to 2°2 . 2°+!, which is exactly the next number away from 
zero. Thanks to this consequence of the encoding, an integer comparison of 
two words (ignoring the actual type of the operands) should give the same 
result as a floating-point comparison, so it is possible to sort normal positive 
floating-point numbers as if they were integers of the same length (64-bit for 
double precision). 

In arbitrary precision, saving one bit is not as crucial as in fixed (small) 
precision, where we are constrained by the word size (usually 32 or 64 bits). 
Thus, in arbitrary precision, it is easier and preferable to encode the whole 
significand. Also, note that having an “implicit bit” is not possible in radix 
2 > 2, since for a normal number the most significant digit might take several 
values, from 1 to 6 — 1. 

When the significand occupies several words, it can be stored in a linked 
list, or in an array (with a separate size field). Lists are easier to extend, but 
accessing arrays is usually more efficient because fewer memory references 
are required in the inner loops and memory locality is better. 

The sign s is most easily encoded as a separate bit field, with a non-negative 
significand. This is the sign-magnitude encoding. Other possibilities are to 
have a signed significand, using either one’s complement or two’s complement, 
but in the latter case a special encoding is required for zero, if it is desired to 
distinguish +0 from —0. Finally, the exponent might be encoded as a signed 
word (for example, type long in the C language). 


3.1.6 Precision: local, global, operation, operand 


The different operands of a given operation might have different precisions, 
and the result of that operation might be desired with yet another precision. 
There are several ways to address this issue. 
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e The precision, say n, is attached to a given operation. In this case, operands 
with a smaller precision are automatically converted to precision n. Operands 
with a larger precision might either be left unchanged, or rounded to preci- 
sion n. In the former case, the code implementing the operation must be able 
to handle operands with different precisions. In the latter case, the round- 
ing mode to shorten the operands must be specified. Note that this round- 
ing mode might differ from that of the operation itself, and that operand 
rounding might yield large errors. Consider for example a = 1.345 and 
b = 1.234567 with a precision of four digits. If b is taken as exact, the exact 
value of a — b equals 0.110433, which when rounded to nearest becomes 
0.1104. If b is first rounded to nearest to four digits, we get b’ = 1.235, and 
a — b’ = 0.1100 is rounded to itself. 


e The precision n is attached to each variable. Here again two cases may occur. 
If the operation destination is part of the operation inputs, as in 
sub(c, a, b), which means c — round(a — 5), then the precision of 
the result operand c is known, and thus the rounding precision is known 
in advance. Alternatively, if no precision is given for the result, we might 
choose the maximal (or minimal) precision from the input operands, or use 
a global variable, or request an extra precision parameter for the operation, 
asinc = sub(a, b, n). 


Of course, these different semantics are inequivalent, and may yield different 
results. In the following, we consider the case where each variable, including 
the destination variable, has its own precision, and no pre-rounding or post- 
rounding occurs. In other words, thereperands are considered exact to their full 
precision. 

Rounding is considered in detail in §3.1.9. Here we define what we mean by 
the correct rounding of a function. 


Definition 3.1 Let a,b,... be floating-point numbers, f a mathematical func- 
tion, n > 1 an integer, and o a rounding mode. We say that c is the cor- 
rect rounding of f(a,b,...), and we write c = o,(f(a,b,...)), if c is the 
floating-point number closest to f(a,b,...) in precision n and according to 
the given rounding mode. In case several numbers are at the same distance 
from f(a, b,...), the rounding mode must define in a deterministic way which 
one is “the closest”. When there is no ambiguity, we omit n and write simply 


c=0(f(a,b,...)). 
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3.1.7 Link to integers 


Most floating-point operations reduce to arithfabtic on the significands, which 
can be considered as integers as seen at the beginning of this section. 
Therefore, efficient arbitrary precision floating-point arithmetic requires effi- 
cient underlying integer arithmetic (see Chapter 1). 

Conversely, floating-point numbers might be useful for the implementation 
of arbitrary precision integer arithmetic. For example, we might use hard- 
ware floating-point numbers to represent an arbitrary precision integer. Indeed, 
since a double-precision floating-point number has 53 bits of precision, it can 
represent an integer up to 2°° — 1, and an integer A can be represented as 
A = an_1 8"! +--+» + a;8' +--+» +18 + ao, where 3 = 2°°, and the a; 
are stored in double-precision data types. Such an encoding was popular when 
most processors were 32-bit, and some had relatively slow integer operations 
in hardware. Now that most computers are 64-bit, this encoding is obsolete. 

Floating-point expansions are a variant of the above. Instead of storing a; 
and having Be implicit, the idea is to directly store a, 3°. Of course, this only 
works for relatively small 7, i.e. whenever a; B* does not exceed the format 
range. For example, for IEEE 754 double precision, the maximal integer preci- 
sion is 1024 bits. (Alternatively, we might represent an integer as a multiple of 
the smallest positive number 2~!°"4, with a corresponding maximal precision 
of 2098 bits.) 

Hardware floafing-pbint numbers might also be used to implement the fast 
Fourier transform (FFT), using complex numbers with floating-point real and 
imaginary part (see §3.3.1). 


3.1.8 Ziv’s algorithm and error analysis 


A rounding boundary is a point at which the rounding function o(2) is discon- 
tinuous. 

In fixed precision, for basic arithmetic operations, it is sometimes possible 
to design one-pass algorithms that directly compute a correct rounding. How- 
ever, in arbitrary precision, or for elementary or special functions, the classical 
method is to use Ziv’s algorithm: 


1. we are given an input «, a target precision n, and a rounding mode; 

2. compute an approximation y with precision m > n, and a corresponding 
error bound ¢€ such that |y — f(x)| < e; 

3. if [y — ©, y +e] contains a rounding boundary, increase m and go to step 2; 


4. output the rounding of y, according to the given rounding mode. 
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The error bound ¢ at step 2 might be computed either a priori, i.e. from x and 
n only, or dynamically, i.e. from the different intermediate values computed by 
the algorithm. A dynamic bound will usually be tighter, but will require extra 
computations (however, those computations might be done in low precision). 

Depending on the mathematical function to be implemented, we might pre- 
fer an absolute or a relative error analysis. When computing a relative error 
bound, at least two techniques are available: we might express the errors in 
terms of units in the last place (ulps), or we might express them in terms of 
true relative error. It is of course possible in a given analysis to mix both kinds 
of errors, but in general a constant factor — the radix @ —is lost when converting 
from one kind of relative error to the other kind. 

Another important distinction is forward versus backward error analysis. 
Assume we want to compute y = f(x). Because the input is rounded, and/or 
because of rounding errors during the computation, we might actually compute 
y’ = f(a’). Forward error analysis will bound |y’ — y| if we have a bound on 
|x’ — «| and on the rounding errors that occur during the computation. 

Backward error analysis works in the other direction. If the computed value 
is y’, then backward error analysis will give us a number 6 such that, for some 
x’ in the ball |x’ — a| < 6, we have y’ = f(x’). This means that the error is 
no worse than might have been caused by an error of 6 in the input value. Note 
that, if the problem is ill-conditioned, 5 might be small even if |y’ — y| is large. 

In our error analyses, we assume that no overflow or underflow occurs, 
or equivalently that the exponent range is unbounded, unless the contrary is 
explicitly stated. 


3.1.9 Rounding 


There are several possible definitions of rounding. For example probabilistic 
rounding — also called stochastic rounding — chooses at random a rounding 
towards +00 or —oo for each operation. The IEEE 754 standard defines four 
rounding modes: towards zero, +00, —oo and to nearest (with ties broken to 
even). Another useful mode is “rounding away from zero”, which rounds in the 
opposite direction from zero: a positive number is rounded towards +oo, and a 
negative number towards —oo. If the sign of the result is known, all IEEE 754 
rounding modes might be converted to either rounding to nearest, rounding 
towards zero, or rounding away from zero. 


Theorem 3.2 Consider a floating-point system with radix 3 and precision n. 
Let u be the rounding to nearest of some real x. Then the following inequalities 
hold: \u — «| < $ ulp(u), ju—a|< 43°" |\ul, ju—a|< $3 |2\. 
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Proof. For x = 0, necessarily u = 0, and the statement holds. Without loss of 
generality, we can assume wu and «x positive. The first inequality is the definition 
of rounding to nearest, and the second one follows from ulp(u) < 61~"u. 
(In the case 6 = 2, it gives |u — x| < 27”|uJ.) For the last inequality, we 
distinguish two cases: if u < a, it follows from the second inequality. If 7 < u, 
then if 2 and u have the same exponent, i.e. Bolt <a<u < @B°, then 
ulp(u) = B°-" < G1~-"x. The remaining case is 8°"! < 2 < u = (8°. Since 
the floating-point number preceding 3° is G°(1 — B~”), and x was rounded to 
nearest, we have |u — z| < G°~"/2 here too. 


In order to round according to a given rounding mode, we proceed as fol- 
lows: 


1. first round as if the exponent range was unbounded, with the given rounding 
mode; 
2. if the rounded result is within the exponent range, return this result; 


3. otherwise raise the “underflow” or “overflow” exception, and return +0 or 
too accordingly. 


For example, assume radix 10 with precision 4, €max = 3, with « = 0.9234 - 
103, y = 0.7656- 107. The exact sum x+y equals 0.99996: 10%. With rounding 
towards zero, we obtain 0.9999 - 10°, which is representable, so there is no 
overflow. With rounding to nearest, x + y rounds to 0.1000 - 10¢, where the 
exponent 4 exceeds €max = 3, SO we get +00 as the result, with an overflow. 
In this model, overflow depends not only on the operands, but also on the 
rounding mode. 

The “round to nearest” mode of IEEE 754 rounds the result of an operation 
to the nearest representable number. In case the result of an operation is exactly 
halfway between two consecutive numbers, the one with least significant bit 
zero is chosen (for radix 2). For example, 1.10112 is rounded with a precision 
of four bits to 1.1102, as is 1.11012. However, this rule does not readily extend 
to an arbitrary radix. Consider for example radix @ = 3, a precision of four 
digits, and the number 1212.111...3. Both 12123 and 12203 end in an even 
digit. The natural extension is to require the whole significand to be even, when 
interpreted as an integer in [3"~', 8” — 1]. In this setting, (1212.111...)3 
rounds to (1212)3 = 5010. (Note that 3” is an odd number here.) 

Assume we want to correctly round a real number, whose binary expansion 
is 2° - O.1bg...bnbn41..., to n bits. It is enough to know the values of r = 
bn41 — called the round bit — and that of the sticky bit s, whith is zero when 
bn+42bn+3... 18 identically zero, and one otherwise. Table 3.1 shows how to 
correctly round given r, s, and the given rounding mode; rounding to +00 
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being converted to rounding towards zero or away from zero, according to the 
sign of the number. The entry “,,” is for round to nearest in the case of a tie: 
if b, = 0, it will be unchanged, but if b,, = 1, we add one (thus changing b,, 
to zero). 


3 


towards zero. tonearest' away from zero 


0 0 0 0 0 
0 1 0 0 1 
1 0 0 bn 1 
1 1 0 A a 


Table 3.1 Rounding rules according to the round bit r and 
the sticky bit s: a “0” entry means truncate (round towards 
zero), a “1” means round away from zero (add one to the 
truncated significand). 


In general, we do not have an infinite expansion, but a finite approximation y 
of an unknown real value «. For example, might be the result of an arithmetic 
operation such as division, or an approximation to the value of a transcendental 
function such as exp. The following problem arises: given the approximation 
y, and a bound on the error |y — «|, is it possible to determine the correct 
rounding of x? Algorithm RoundingPossible returns true iff it is possible. 


Algorithm 3.1 RoundingPossible 
Input: a floating-point number y = 0.ly2... Ym, a precision n < m, an error 
bound ¢ = 2-*,a rounding mode o 
Output: true when o,,(x) can be determined for |y — 2| < ¢ 
if k < + 1 then return false 
if o is to nearest then r — 1 else 7 — 0 
if Ynti = 7 and ynyo =--: = yr = O then s — Oelses~ 1 
if s = 1 then return ¢rve else return false. 


Proof of correctness. Since rounding is monotonic, it is possible to determine 
o(x) exactly when o(y — 2—") = o(y + 27*), or in other words when the 
interval [y — 2—*, y + 2~*] contains no rounding boundary (or only one as 
y —2-* ory +2-*). 

If k < n+ 1, then the interval [-2-*, 2-*) has width at least 2~”, and 
thus contains at least one rounding boundary in its interior, or two rounding 
boundaries, and it is not possible to round correctly. In the case of 


90 Floating-point arithmetic 


directed rounding (resp. rounding to nearest), if s = 0, the approximation y is 
representable (resp. the middle of two representable numbers) in precision 
n, and it is clearly not possible to round correctly. If s = 1, the interval 
ly — gok. yt 2-*) contains at most one rounding boundary, and, if so, it is one 
of the bounds; thus, it is possible to round correctly. 


The double rounding problem 
When a given real value « is first rounded to precision m and then to precision 
n <m, we Say that a “double rounding” occurs. The “double rounding prob- 
lem” happens when this latter value differs from the direct rounding of x to the 
smaller precision n, assuming the same rounding mode is used in all cases, i.e. 
when 


On (Om (a) Fx On (x). 


The double rounding problem does not occur for directed rounding modes. 
For these rounding modes, the rounding boundaries at the larger precision m 
refine those at the smaller precision n, thus all real values x that round to the 
same value y at precision m also round to the same value at precision n, namely 
On(y). 

Consider the decimal value x = 3.14251. Rounding to nearest to five digits, 
we get y = 3.1425; rounding y to nearest-even to four digits, we get 3.142, 
whereas direct rounding of xz would give 3.143. 

With rounding to nearest mode, the double rounding problem only occurs 
when the second rounding involves the even-rule, i.e. the value y = 0,,(2) is 
a rounding boundary at precision n. Otherwise, y has distance at least one ulp 
(in precision m) from a rounding boundary at precision n, and since |y — «| is 
bounded by half an ulp (in precision ™), all possible values for x round to the 
same value in precision n. 

Note that the double rounding problem does not occur with all ways of 
breaking ties for rounding to nearest (Exercise 3.2). 


3.1.10 Strategies 


To determine the correct rounding of f(x) with n bits of precision, the best 
strategy is usually to first compute an approximation y to f(a) with a working 
precision of m = n+ with h relatively small. Several strategies are pos- 
sible in Ziv’s algorithm (§3.1.8) when this first approximation y is not accurate 
enough, or too close to a rounding boundary: 


e Compute the exact value of f(x), and round it to the target precision n. 


This is possible for a basic operation, for example f(2) = x”, or more 
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generally f(x,y) = «+ yor x y. Some elementary functions may yield 
an exactly representable output too, for example /2.25 = 1.5. An “exact 
result” test after the first approximation avoids possibly unnecessary further 
computations. 


e Repeat the computation with a larger working precision m’ = n + h’. As- 
suming that the digits of f(x) behave “randomly” and that | f’(a)/f(x)| is 
not too large, using h’ ~ lg n is enough to guarantee that rounding is possi- 
ble with probability 1— O(1/n). If rounding is still not possible, because the 
h’ last digits of the approximation encode 0 or 2h’ — 1, we can increase the 
working precision and try again. A check for exact results guarantees that 
this process will eventually terminate, provided the algorithm used has the 

Cp ee that it a ie exact result if this result is representable and the 

working precision is high enough. For example, the square root algorithm 
should return the exact result if it is representable (see Algorithm FPSqrt in 

§3.5, and also Exercise 3.3). 


3.2 Addition, subtraction, comparison 


Addition and subtraction of floating-point numbers operate from the most sig- 
nificant digits, whereas integer addition and subtraction start from the least 
significant digits. Thus completely different algorithms are involved. Also, in 
the floating-point case, part or all of the inputs might have no impact on the 
output, except in the rounding phase. 

In summary, floating-point addition and subtraction are more difficult to im- 
plement than integer addition/subtraction for two reasons: 


e Scaling due to the exponents requires shifting the significands before adding 
or subtracting them — in principle, we could perform all operations using 
only integer operations, but this might require huge integers, for example 
when adding 1 and 2~ 19°, 


e As the carries are propagated from least to most significant digits, we may 
have to look at arbitrarily low input digits to guarantee correct rounding. 


In this section, we distinguish between “addition”, where both operands to 
be added have the same sign, and “subtraction”, where the operands to be 
added have different signs (we assume a sign-magnitude representation). The 
case of one or both operands zero is treated separately; in the description below 
we assume that all operands are non-zero. 
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3.2.1 Floating-point addition 


Algorithm FPadd adds two binary floating-point numbers b and c of the same 
sign. More precisely, it computes the correct rounding of b + c, with respect 
to the given rounding mode o. For the sake of simplicity, we assume b and c 
are positive, b > c > 0. It will also be convenient to scale b and c so that 
2-1 <b < 2” and 2”~! < c < 2’, where n is the desired precision of the 
output, and m < n. Of course, if the inputs b and c to Algorithm FPadd are 
scaled by 2* then, to compensate for this, the output must be scaled by Q-k 
We assume that the rounding mode is to nearest, towards zero, or away from 
zero (rounding to +oo reduces to rounding towards zero or away from zero, 
depending on the sign of the operands). 


Algorithm 3.2 FPadd 
Input: b > c > 0 two binary floating-point numbers, a precision n such that 
2"-1 < 6 < 2”, and a rounding mode o 
Output: a floating-point number a of precision n and scale e such that 
a+ 2° =o(b+c) 
1: split b into by, + be where b;, contains the n most significant bits of b. 
2: split c into cy, + ce where cp, contains the most significant bits of c, and 


ulp(c;,) = ulp(b,) = 1 > cp, might be zero 
: an — ban +cn, e—0 LC] 
: (¢,7, 8) — be + ce > see the text 


(a,t) — (an +c + round(o,7, s), etc.) > for ¢ see Tabl¢ 3.2 (upper) 
if a > 2” then 
(a, e) — (round2(o, a, t),e +1) > see Table 3.2 (lower) 
if a = 2” then (a,e) — (a/2,e + 1) 


return (a, e). 


> 200 Sa SOY Se 


[J 


The values of round(o,r, s) and round2(o, a, ¢) are given in Table 3.2. We 
have simplified some of the expressions given in Table 3.2. For example, in 
the upper half of the table, r V s means 0 if 7 = s = 0, and 1 otherwise. In 
the lower half of the table, 2| (a + 1)/4] is (a — 1)/2 if a = 1 mod 4, and 
(a+44/2 ifa =3 mod 4. 

At step 4 of Algorithm FPadd, the notation (c, 7, s) — be+cy means that c is 
the carry bit of by + ce, r the round bit, and s the sticky bit; c, r,s € {0,1}. For 
rounding to nearest, t = sign(b+c—a) is a ternary value, which is respectively 
positive, zero, or negative when a is smaller than, equal to, or larger than the 
exact sum b+ c. 


Theorem 3.3 Algorithm FPadd is correct. 
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° r 8 round(o, r, s) t 
towards 0 any any 0 - 
away fromO any — any rV 8 - 
to nearest 0 any 0 8 
to nearest 1 0 0/1 (even rounding) +1/-1 
to nearest al #0 1 -1 
° a mod 2 t round2(o, a, t) 
any 0 any a/2 
towards 0 1 any (a —1)/2 
away from 0 1 any (a+1)/2 
to nearest 1 0 2|(a+1)/4| 
to nearest 1 21 (a + t)/2 


Table 3.2 Rounding rules for addition. 


Proof. We have 2”~1 < b < 2” and 2”™-! < ¢ < 2”, with m < n. Thus, 
b;, and LA are the integer parts of b and c, bg and c,¢ their fractional parts. Since 
b > c, we have c;, < by, and 2”~! < by, < 2”—1; thus, 2”—! < ap, < 2"+1-2, 
and, at step 5, 2"~-! <a < 2”"*! Ifa < 2”, ais the correct rounding of b+. 
Otherwise, we face the “double rounding” problem: rounding a down to n bits 
will give the correct result, except when a is odd and rounding is to nearest. In 
that case, we need to know if the first rounding was exact, and if not in which 
direction it was rounded; this information is encoded in the ternary value t. 
After the second rounding, we have gn-l < q@< 2”, 


Note that the exponent e,, of the result lies between ey (the exponent of b — 
here we considered the case e, = n) and e, + 2. Thus, no underflow can occur 
in an addition. The case e, = e, + 2 can occur only when the destination 
precision is less than that of the operands. 


3.2.2 Floating-point subtraction 


Floating-point subtraction (of positive operands) is very similar to addition, 
with the difference that cancellation can occur. Consider for example the sub- 
traction 6.77823 — 5.98771. The most significant digit of both operands disap- 
peared in the result 0.79052. This cancellation can be dramatic, as in 
6.7782357934 — 6.7782298731 = 0.0000059203, where six digits were can- 
celled. 
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Two approaches are possible, assuming n result digits are wanted, and the 
exponent difference between the inputs is d: 


e Subtract the n — d most-significant digits of the smaller operand from the 
n most-significant digits of the larger operand. If the result has n — e digits 
with e > 0, restart with n + e digits from the larger operand and (n + e) —d 
from the smaller operand. 

e Alternatively, predict the number e of cancelled digits in the subtraction, 
and directly subtract the (n + e) — d most-significant digits of the smaller 
operand from the n + e most-significant digits of the larger one. 


Note that, in the first approach, we might have e = n if all most-significant 
digits cancel, and thus the process might need to be repeated several times. 

The first step in the second approach is usually called leading zero detec- 
tion. Note that the number e of cancelled digits might depend on the rounding 
mode. For example, 6.778 — 5.7781 with a 3-digit result yields 0.999 with 
rounding toward zero, and 1.00 with rounding to nearest. Therefore, in a real 
implementation, the definition of e has to be made precise. 

In practice, we might consider n + g and (n + g) — d digits instead of n 
and n — d, where the g “guard digits” would prove useful (i) to decide the final 
rounding, and/or (ii) to avoid another loop in case e < g. 


Sterbenz’s theorem 
Sterbenz’s theorem is an important result concerning floating-point subtraction 
(of operands of the same sign). It states that the rounding error is zero in some 
common cases. More precisely: 


Theorem 3.4 (Sterbenz) If x and y are two floating-point numbers of same 
precision n, such that y lies in the interval [x /2, 2x] U [2x, 7/2], then y — x is 
exactly representable in precision n, if there is no underflow. 


Proof. The case x = y = 0 is trivial, so assume that x # 0. Since y € 
[x/2, 2a] U [2a, 2/2], x and y must have the same sign. We assume without 
loss of generality that x and y are positive, so y € [a/2, 2a]. 

Assume x < y < 22 (the same reasoning applies for 7/2 < y < x, ie. y < 
x < 2y, by interchanging x and y). Since x < y, we have ulp(x) < ulp(y), 
and thus y is an integer multiple of ulp(z). It follows that y — x is an integer 
multiple of ulp(x). Since 0 < y—ax < x, y — x is necessarily representable 
with the precision of x. 


It is important to note that Sterbenz’s theorem applies for any radix (3; the 
constant 2 in [2 /2, 22] has nothing to do with the radix. 
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3.3 Multiplication 


Multiplication of floating-point numbers is called a short product. This reflects 
the fact that, in some cases, the low part of the full product of the signifi- 
cands has no impact — except perhaps for the rounding — on the final result. 
Consider the multiplication « x y, where x = 6° and y = mG. Then 
o(ay) = bém) 3°, and it suffices to consider the case that x = @ and y = m 
are integers, and the product is rounded at some weight (9 for g > 0. Either 
the integer product ¢ x fv i$ computed exactly, using one of the algorithms 
from Chapter 1, and then rounded; or the upper part is computed directly using 
a “short product algorithm’, with correct rounding. The different cases that can 
occur are depicted in Figure 3.1. 


” (a) 7 (b) 


" (c) * (dd) 


Figure 3.1 Different multiplication scenarios, according to the input and output 
precisions. The rectangle corresponds to the full product of the inputs x and y 
(most significant digits bottom left), the triangle to the wanted short product. 
Case (a): no rounding is necessary, the product being exact; case (b): the full 
product needs to be rounded, but the inputs should not be; case (c): the input x 
with the larger precision might be truncated before performing a short product; 
case (d): both inputs might be truncated. 


An interesting question is: how many consecutive identical bits can occur 
after the round bit? Without loss of generality, we can rephrase this question 
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as follows. Given two odd integers of at most n bits, what is the longest run 
of identical bits in their product? (In the case of an even significand, we might 
write itm = €2° with @ odd.) There is no a priori bound except the trivial one 
of 2n — 2 for the number of zeros, and 2n — 1 for the number of ones. For 
example, with a precision 5 bits, 27 x 19 = (1000000 001). More generally, 
such a case corresponds to a factorization of 2?”—' + 1 into two integers of n 
bits, for example 258513 x 132913 = 2°° + 1. Having 2n consecutive ones 
is not possible since 2?” — 1 can not factor into two integers of at most n bits. 
Therefore, the maximal runs have 2n — 1 ones, for example 217 x 151 = 
(111.111 111111111), for n = 8. A larger example is 849583 x 647089 = 
239 _ 1, 

The exact product of two floating-point numbers mG° and m’ Be is 
(mm! ypere’. Therefore, if no underflow or overflow occurs, the problem re- 
duces to the multiplication of the significands m and m’. See Algorithm 
FPmultiply. 

The product at step 1 of FPmultiply is a short product, i.e. a product whose 
most significant part only is wanted, as discussed at the start of this section. In 
the quadratic range, it can be computed in about half the time of a full product. 
In the Karatsuba and Toom—Cook ranges, Mulders’ algorithm can gain 10% to 
20%; however, due to carries, implementing this algorithm for floating-point 
computations is tricky. In the FFT range, no better algorithm is known than 
computing the full product mm’ and then rounding it. 


Algorithm 3.3 FPmultiply 
Input: x = m- B°, 2’ =m’ - Be, a precision n, a rounding mode o 
Output: o(xx’) rounded to precision n 

1: m"” — o(mm’) rounded to precision n 

2; return m” - Bete’. 


Hence, our advice is to perform a full product of m m’, possibly after 
truncating them to n + g digits if they have more than n digits. Here g (the 
number of guard digits) should be positive (see Exercise 3.4). 

It seems wasteful to multiply n-bit operands, producing a 2n-bit product, 
only to discard the low-order n bits. Algorithm ShortProduct computes an 
approximation to the short product without computing the 2n-bit full product. 
It uses a threshold np > 1, which should be optimized for the given code base. 


Error analysis of the short product. Consider two n-word normalized sig- 
nificands A and B that we multiply using a short product algorithm, where the 
notation FullProduct(A, B) means the full integer product A- B. 
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Algorithm 3.4 ShortProduct 
Input: integers A, B, and n, with 0 < A,B < 2” 
Output: an approximation to AB div 3” 
Require: a threshold no 
ifn < no then return FullProduct(A, B) div 6” 
choose k > n/2,—n—k 
C, — FullProduct(A div 3°, B div 6°) div B*~* 
C2 — ShortProduct(A mod 3°, B div *, é) 
C3 — ShortProduct(A div 3", B mod 8°, £) 
return C) + Co + C3. 


C1 


B 


Figure 3.2 Graphical view of Algorithm ShortProduct: 
the computed parts are C),C2,C3, and the neglected 
parts are C5, C3, C4 (most significant part bottom left). 


Theorem 3.5 The value C’ returned by Algorithm ShortProduct differs from 
the exact short product C = AB div 3” by at most 3(n — 1) 


O<C SO +31). 


Proof. First, since A, B are non-negative, and all roundings are truncations, 
the inequality C’ < C follows. 

A=); a,’ and B = Dus b; 39, where 0 < aj,b; < @. The pos- 
si rrors come from: (i) the neglected a,b; terms, i.e. parts C$, C4, C4 of 
Figure 3.2; (ii) the truncation while computing C}; (iii) the error in the recur- 
sive calls for Cz and C3. 

We first prove that the algorithm accumulates all products a,b; with i +7 > 
nL]. This corresponds to all terms on and below the diagonal in 
Figure 3.2. The most significant neglected terms are the bottom-left terms from 
C4 and C4, respectively ag_1b,—1 and a,_—1be_1. Their contribution is at most 
2(3—1)?"-?. The neglected terms from the next diagonal contribute at most 
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4((3 — 1)?8"-%, and so on. The total contribution of neglected terms is thus 
bounded by 


(8 — 1)?6"[28-? + 48-3 + 687-4 +---] < 28” 


(the inequality is strict since the sum is finite). 
The truncation error in C} is at most 6”, thus the maximal difference ¢(n) 
between C and C” satisfies 


e(n) < 34+ 2e(|n/2]), 


which gives ¢(n) < 3(n — 1), since e(1) = 0. 


REMARK: if one of the operands was truncated before applying Algorithm 
ShortProduct, simply add one unit to the upper bound (the truncated part is 
less than 1, and thus its product by the other operand is bounded by 3”). 

The complexity S(n) of Algorithm ShortProduct satifies the recurrence 
S(n) = M(k)+2S(n—k). The optimal choice of k depends on the underlying 
multiplication algorithm. Assuming M(n) = n® fora > 1 and k = yn, we 
get 


7 
S(n) = Toa — ye: 
where the optimal value is 7 = 1/2 in the quadratic range, y ~ 0.694 in 
the Karatsuba range, and y ~ 0.775 in the Toom—Cook 3-way range, giving 
respectively S(n) ~ 0.5M(n), S(n) ~ 0.808M(n), and S(n) ~ 0.888M(n). 
The ratio S(n)/M(n) — 1 as r — oo for Toom—Cook r-way. In the FFT 
range, Algorithm ShortProduct is not any faster than a full product. 


3.3.1 Integer multiplication via complex FFT 


To multiply om integers, it may be advantageous to use the fast Fourier tran- 
form (FFT, see §1.3.4, §2.3). Note that three FFTs give the cyclic convolution 
z=a*y defined by 


a= py LjYk—j mod N for O<KE<N. 
0<j<N 
In order to use the FFT for integer multiplication, we have to pad the input 
vectors with zeros, thus increasing the length of the transform from N to 2. 
FFT algorithms fall into two classes: thos¢ using number theoretical proper- 
ties (typically working over a finite ring, as in §2.3.3), and those based on com- 
plex floating-point computations. The latter, while not having the best asymp- 
totic complexity, exhibit good practical behavior, because they take advantage 
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of the efficiency of floating-point hardware. The drawback of the complex 
floating-point FFT (complex FFT for short) is that, being based on floating- 
point computations, it requires a rigorous error analysis. However, in some 
contexts where occasional errors are not disastrous, we may accept a small 
probability of error if this speeds up the computation. For example, in the con- 
text of integer factorization, a small probability of error is acceptable because 
the result (a purported factorization) can easily be checked and discarded if 
incorrect. 
The following theorem provides a tight error analysis: 


Theorem 3.6 The complex FFT allows computation of the cyclic convolution 
z= a *y of two vectors of length N = 2” of complex values such that 


llz’ = Aloo < |lell - IIyll (1 +)?" + v5)" + 4)" — 1), G2) 


where || - || and || - ||.o denote the Euclidean and infinity norms respectively, 
é is such that |\(a + b)' — (a+ b)| < ela + Bj, |(ab)’ — (ab)| < elad| for all 
,0<k< Now=e2/, and 
(-)' refers to the computed (stored) value of (-) for each expression. 


machine floats a, b. Here js > |(w*)' — (w*) 


For the IEEE 754 double-precision formht, with rounding to nearest, we have 
e = 2-3 and if the w* are correctly rounded, we can take p. = ¢/ \/2. For a 
fixed FFT size N = 2”, the inequality (3.2) enables us to compute a bound B 
on the compohentk of « and y that guarantees ||z’ — z||,. < 1/2. If we know 
that the exact result z € Z, this enables us to uniquely round the components 
of z’ to z. Table 3.3 gives b = lg B, the number of bits that can be used 
in a 64-bit floating-point word, if we wish to perfprmh m-bit multiplication 
exactly (here m =] b). It is assumed that the FFT is performed with signed 
components in ZM [—2°-!, +2°—'), see for example [80, p. 1 

Note that Theorem 3.6 is a worst-case result; with rounding to nearest we 
expect the error to be smaller due to cancellation — see Exercise 3.9. 

Since 64-bit floating-point numbers have bounded precision, we cannot 
compute arbitrarily large convolutions by this method — the limiti out 
n = 43. However, this corresponds to vectors of size N = 2” = of (a 
which is more than enough for practical purposes — see also Exercise 3.11. 


3.3.2 The middle product 


Given two integers of 2n and n bits respectively, their “middlel product” con- 
sists of the middle n bits of their 3n-bit product (see Figure 3.3). The middle 
product might be computed using two short products, one (low) short product 
between «x and the high part of y, and one (high) short product between x and 


100 Floating-point arithmetic 


n b m n b m 

1 25 25 11 18 18432 
2 24 48 12 17 34816 
3 23 92 13 17 69632 
4 22 176 14 16 131072 
5 22 352 15 16 262144 
6 21 672 16 15 491520 
7 20 1280 17 15 983040 
8 20 2560 18 14 1835008 
9 19 4864 19 14 3670016 
10 19 9728 20 13 6815744 


Table 3.3. Maximal number b of bits per IEEE 754 
double-precision floating-point number binary 64 (53-bit 
significand), and maximal m for a plain m x m bit integer 
product, for a given FFT size 2”, with signed components. 


= 


the low part of y. However there are algorithms to compute a 2n x n middle 
product with the same ~ M(n) complexity as an n x n full product (see §3.8). 


y 


Figure 3.3 The middle product of of n bits and y of 27n bits 
corresponds to the middle region (most significant bits bottom 
left). 


Several applications benefit fhoratan efficient middle product. One of these 
re heme Newton’s method (§4.2). Consider, for example, the reciprocal 
iteration (84.2.2): 7341 = x; + 2;(1 — a;y). If x; has n bits, we have to 
consider 2n bits from y in order to get 2n accurate bits in x;,1. The product 
xjy has 3n bits, but if x; is accurate to n bits, the m most significant bits 
of x;y cancel with 1, and the n least significant bits can be ignored as they 
only contribute noise. Thus, the middle product of x; and y is exactly what is 
needed. 
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Payne and Hanek argument reduction 
Another application of the middle product is Payne and Hanek argument re- 
duction. Assume « = m - 2° is a floating-point number with a significand 
0.5 < m < 1 of n bits and a large exponent e (say n = 53 and e = 1024 to fix 
the ideas). We want to compute sin x with a precision of n bits. The classical 
argument reduction works as follows: first compute k = |a/7], then compute 
the reduced argument 


xe’ =a2—kn. (3.3) 


About e bits will be cancelled in the subtraction x — (k7r), and thus we need to 
compute kz with a precision of at least e + n bits to get an accuracy of at least 
n bits for x’. Of course, this assumes that «x is known exactly — otherwise there 
is no point in trying to compute sin x. Assuming 1/7 has been precomputed to 
precision e, the computation of k costs M(e,n), and the multiplication k x 7 
costs M(e, e + n); therefore, the total cost is about M(e) when e > n. 


1/a 


y 


Figure 3.4 A graphical view of Payne and Hanek algorithm. | 


The key idea of the Payne and Hanek algorithm is to rewrite Eqn. (3.3) as 

a! =n (=-k). (3.4) 
1 

If the significand of x has n < e bits, only about 2n bits from the expansion 
of 1/7 will effectively contribute to the n most significant bits of x’, namely 
the bits of weight 2~°-”" to 2~¢*”. Let y be the corresponding 2n-bit part 
of 1/7. Payne and Hanek’s algorithm works as follows: first multiply the n- 
bit significand of x by y, keep the n middle bits, and multiply by an n-bit 
approximation of 7. The total cost is ~ (IM (2n,n)+M(n)), or even ~2M(n) 

if the middle product is performed in time M/(n), and thus independent of e. 


3.4 eee iprocal and division 


As for integer operations ($1.4), we should try as far as possible to trade 
floating-point divisions for multiplications, since the cost of a floating-point 
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multiplication is theoretically smaller than the cost of a division by a constant 
factor (usually from 2 to 5, depending on the algorithm used). In practice, the 
ratio might not even be constant, unless care is taken in implkmenting division. 
Some implementations provide division with cost O(M(n) log n) or O(n?). 

When several divisions have to be performed with the same divisor, a well- 
known trick is to first compute the reciprocal of the divisor (§3.4.1); then each 
division reduces to a multiplication by the reciprocal. A small drawback is that 
each division incurs two rounding errors (one for the reciprocal and one for 
multiplication by the reciprocal) instead of one, so we can no longer guarantee 
a correctly ro result. For example, in base ten with six digits, 3.0/3.0 
might evaluate to 0.999 999 = 3.0 x 0.333 333. 

The cases of a single division, or several divisions with a varying divisor, 
are considered in §3.4.2. 


3.4.1 Reciprocal in| 


Here we describe algorithms that compute an approximate reciprocal of a pos- 
itive floating-point number a, using integer-only operations (see Chapter 1). 
The integer operations simulate floating-point computations, but all roundings 
are made explicit. The number a is represented by an integer A of n words in 
radix 3: a = 3~" A, and we assume 3"/2 < A, thus requiring 1/2 <a < 1. 
(This does not cover all cases for 3 > 3, but if B"-! < A < 8” /2, multiplying 
A by some appropriate integer k < (3 will reduce to the cafe 9" /2 < A; then 
it suffices to multiply the reciprocal of ka by k.) 

We first perform an error analysis of Newton’s method (§4.2) assuming all 
computations are done with infinite precision, and thus neglecting roundoff 
errors. 


Lemma 3.7 Let 1/2<a<1,p=1/a,2 >0,and2' = x+2(1—az). Then 


8 
i 


O<p-2'< a(p-2)’, 
for some 0 € [min(z, p), max(z, p)]. 


Proof. Newton’s iteration is based on approximating the function by its tan- 
gent. Let f(t) = a — 1/t, with p the root of f. The second-order expansion of 
f att = p with explicit remainder is 
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for some 0 € [min(z, ), max(z, p)]. Since f(p) = 0, this simplifies to 


fle) _ (= 2) 2") 
PP Ray 2 Fila)’ i 


Substituting f(t) = a—1/t, f’(t) = 1/t? and f”(t) = —2/t3, it follows that 


x 
p=x+2(1—ar) + 2 (p— 2)’, 


which proves the claim. 


Algorithm ApproximateReciprocal computes an approximate reciprocal. 
The input A is assumed to be normalized, i.e. 6"/2 < A < 6”. The output 
integer X is an approximation to 6?"/A. 


Algorithm 3.5 ApproximateReciprocal 
Input: A= yo, a;3', with O < a; < Band 8/2 < ay_1 


Output: X = 6" + ae x;,0" with 0 < 2; < B 
1: ifn < 2 then return [3?"/A] —1 
2: €<— |(n-1)/2|,hen-2 
3: An — > api" 

4: X;, — ApproximateReciprocal(A;,) 

5 

6 

7 


: T—AX), 
: while T > "+" do 
c (tA Pew 
8 TH Brth_ 7 
9: Tm — |TB-| 
10: U — T,, Xn 
11: return X;,3° + |UB*-?" |. 


Lemma 3.8 /f 3 is a power of two satisfying 3 > 8, and B"/2 < A < B", 
then the output X of Algorithm ApproximateReciprocal satisfies 


AX < f°" < A(X +2). 


Proof. For n < 2, the algorithm returns XY = |6?"/A|, unless A = 6"/2, 
when it returns X = 23” —1. In both cases, we have AX < 6?" < A(X +1); 
thus, the lemma holds for n < 2. 

Now consider n > 3. We have = |(n—1)/2| andh = n—Q@, and therefore 
n=h+£andh > £. The algorithm first computes an approximate reciprocal 
of the upper A words of A, and then updates it to m words using Newton’s 
iteration. 


O 
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After the recursive call at line 4, we have by induction O a 
AnXpn < 82" < An(Xp 4 2). (3.6) 


After the product T — AX), and the while-loop at steps 6-7, we still have 
T = AXp, where T and X;, may have new values, and in addition T < Brth, 
We also have 3"+” < T + 2A; we prove this by distinguishing|twd cases. 
Either we entered the while-loop, then since the value of T decreased by A at 
each loop, the previous value T + A was necessarily > 6”"+". If we did not 
enter the while-loop, the value of T is still AX},. Multiplying Eqn. (3.6) by 3° 
gives: 8" +" < Ay, B°(X, +2) < A(X, +2) =T + 2A. Thus, we have 


O fegtt <1 +2A. 


It follows that T > B"+? —2A > B+ — 28". As a consequence, the value of 
"+" _T computed at step 8 can not exceed 2/3” —1. The last lin mpute the 
product T;,,X;,, where T;,, is the upper part of 7’, and put its @ significant 
words in the low part X¢ of the result X. 

Now let us perform the error analysis. Compared to Lemma 3.7, x stands 
forLy,b-", a stands for AG~", and 2’ stands for X 3~". The while-loop en- 
sures that we start from an approximation « < 1/a, i.e. AX, < 6"+". Then 
Lemma 3.7 guarantees that x < x’ < 1/aif x’ is computed with infinite preci- 
sion. Here we have x < 2’, since X = X),"-+ X,, where X; > 0. The only 
differences compared to infinite precision are: 


e the low @ words from 1 — az (here T at line 8) are neglected, and only its 
upper part (1 — a2); (here T,,,) is considered; 
e the low 2h — ¢ words from «(1 — az); are neglected. 


Those two approximations make the computed value of x’ < the value which 
would be computed with infinite precision. Thus, for the computed value 2’, 
we have 


EJ a<a!<1/a. 


From Lemma 3.7, the mathematical error is bounded by 7?0~3(p — x)? < 
43-2", since 2? < 6? and |p—2| < 2G~". The truncation from 1 — az, which 
is multiplied by x < 2, produces an error < 2(3~?", Finally, the truncation of 
«(1 — ax), produces an error < 3~". The final result is thus 


ai <p<2'+6p-7* +67". 


Assuming 63-2" < B-”, which holds as soon as 3 > 6 since 2h > n, this 
simplifies to 


ze <p<a2'+28™, 
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which gives with 7’ = X3~" and p = 0"/A 


2n 
X <— < X42. 
= ate 


Since (3 is assumed to be a power of two, equality can hold only when A is 
itself a power of two, ie. A = ("/2. In this case, there is only one value 
of X;, that is possible for the recursive call, namely X;, = 23” — 1. In this 
case, T = grth — 8B" /2 before the while-loop, which is not entered. Then 
Brth _T = B{2]which multiplied by X;, gives (again) 6"+" — 8" /2, whose 
h most significant words are 3 — 1. Thus, X¢ = Bf —1, and X = 28" — 1. 


REMARK. Lemma 3.8 might be extended to the case 3”~' < A < 6", or to 
a radix 3 which is not a power of two. However, we prefer to state a restricted 
result with simple bounds. 


COMPLEXITY ANALYSIS. Let J(n) be the cost to invert_an n-word num- 
ber using Algorithm ApproximateReciprocal. If we negleéet the linear costs, 
we have I(n) = I(n/2) + M(n,n/2) + M(n/2), where M(n, is the 
cost of an n x (n/2) product — the product AX), at step 5 — and M(n/2) 
the cost of an (n/2) x (n/2) product — the product T;,,X;, at step 10. If the 
n x (n/2) product is performed via two (n/2) x (n/2) products, we have 
I(n) & I(n/2)+3M (n/2), which yields [(n) ~ M(n) in the quadratic range, 
~ 1.5M(n) in the Karatsuba range, ~ 1.704//(n) in the Toom—Cook 3-way 
range, and ~3M(n) in the FFT range. In the FFT range, an n x (n/2) product 
might be directly computed! by three FFTs of length 3n/2 words, amounting 
to ~ M(3n/4); in this case, the complexity decreases to ~ 2.5//(n) (see the 
comments at the end of §2.3.3, page 58). 


THE WRAP-AROUNP) TRICKr—We now describe a slight modification of 
Algorithm ApproximateReciprocal, which yields a complexity 2//(n). In 
the product AX), at step 5, Eqn. (3.6) tells us that the result approaches 3"+", 
or more precisely 


Brth — 28" < AX, < Orth + 28". (3.7) 


Assume we use an FFT-based algorithm such as the Schénhage—Strassen 
algorithm that computes products modulo 3” + 1, for some integer 
(n,n+h). Let AX, =UB™+V withO < V < £™. It follows from Egn. (3.7) 
that U = 6"th—™ or U = Brth—™ _ 1. Let T = AX), mod (3 + 1) be the 
value computed by the algorithm. Now T = V —U or T = V—U+(6™ +1). 
It follows that AX, = T + U(8™ 4+ 1) or AX, = T+ (U — 1)(8" +1). 


106 Floating-point arithmetic 


Taking into account the two possible values of U, we have 
AX, Fy + (art™ —e)(6™ +1), 


where ¢ € {0,1,2}. Since 3 > 6, 6 > 4”, thus only one value of € yields 
a value of AX), in the interval ((3”*+” — 28", Gr +h + 28"), 

Thus, we can replace step 5 in Algorithm ApproximateReciprocal by the 
following code: 
Compute T = AX; mod (8™ + 1) using FFTs with length m > n 
THT er + pre > the case e = 0 
while T > 6"+" + 28” do 

T-—T-—(6"+1) 


Assuming that we can take m close to n, the cost of the product AX), is 
only about that of three FFTs of length n, i.e. ~ M(n/2). 


3.4.2 Division 


In this section, we consider the case where the divisor changes between succes- 
sive operations, so no precomputation involving the divisor can be performed. 
We how that the number of consecutive zeros in the result is bounded by 
the divisor length, then we consider the division algorithm and its complexity. 
Lemma 3.10 analyses the case where the division operands are truncated, be- 
cause they have a larger precision than desired in the result. Finally, we discuss 
“short division” and the error analysis of Barrett’s algorithm. 

A floating-point division reduces to an integer division as follows. Assume 
dividend a = ¢.3° and divisor d = m-3/, where £, m are integers. Then a/d = 
(¢/m) °F. If k bits of the quotient are needed, we first determine a scaling 
factor g such that 3*-! < |069/m| < @*, and we divide (39 — truncated 
if needed — by m. The following theorem gives a bound on the number of 
consecutive zeros after the integer part of the quotient of | 39| by m. 


Theorem 3.9 Assume we divide an m-digit positive integer by an n-digit pos- 
itive integer in radix 3, with m > n. Then the quotient is either exact, or its 
radix (3 expansion admits at most n — 1 consecutive zeros or ones after the 
digit of weight 3°. 


Proof. We first consider consecutive zeros. If the expansion of the quotient q 
admits n or more consecutive zeros after the binary point, we can write q = 
qi. +8-"qo, where q; is an integer and 0 < go < 1. If qo = 0, then the quotient 
is exact. Otherwise, if a is the dividend and d is the divisor, we should have 
a= qd+~"qod. However, a and q:d are integers, and 0 < G~"qod < 1, so 
3-"qod can not be an integer, and we have a contradiction. 
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For consecutive ones, the proof is similar: write g = q, — 8~"qo, with 
0 <q <1. Since d < Gy, we still have 0 < G~"qod < | 


Algorithm DivideNewton performs the division of two n-digit floating- 
point numbers. Thhelkey idea is to approximate the inverse of the divisor to half 
precision only, at the expense of addftipnal steps. At step 4, MiddleProduct 
(qo, d) denotes the middle product of qo and [d] i.e. the n/2 middle digits of 
that product. At step 2] r is an approximation to 1/d,, and thus to 1/d, with 
precision n/2 digits. Therefore, at step 3, go approximates c/d to about n/2 
digits, and the upper n/2 digits of god at step 4 agree with those of c. The 
value e computed at step 4 thus equals god — c to precision n/2. It follows that 
re & e/d agrees with go — c/d to precision n/2; hence, the correction term 
(which is really a Newton correction) added in the last step. 


Algorithm 3.6 DivideNewton 
Input: n-digit floating-point numbers c and d, with n even, d normalized 
Output: an approximation of c/d 

1: write d = d,"/? + do with 0 < di, dy < 6"/? 

2: r — ApproximateReciprocal(d;, 7/2) 
3: go — cr truncated to n/2 digits 
4 
5 


: e — MiddleProduct(qo, d) 
:g—q-Te. oO 
OO 

In the FFT range, the cost of Algorithm DivideNewton is ~ 2[5]V(n): step 2 
costs ~ 2M (n/2) ~ M(n) with the wrap-around trick, and steps 3-5 each 
cost ~ M(n/2), using a fast middle product algorjthm for step 4. By way of 
comparison, if we computed a full precision inverse as in Barrett’s algorithm 
(see below), the cost would be ~3.5M(n). (See §3.8 for improved asymptotic 
bounds on division.) oO 

In the Karatsuba range, Algorithm DivideNewton costs ~ 1.5//(n), and is 
useful provided the middle product of step 4 is performed with cost ~ M(n/2). 
In the quadratic range, Algorithm DivideNewton costs ~ 2M(n), and a clas- 
sical division should be preferred. 

When the requested precision for the output is smaller than that of the inputs 
of a division, we have to truncate the inputs in order to avoid an unnecessarily 
expensive computation. Assume for example that we want to divide two num- 
bers of 10, 000 bits, with a 10-bit quotient. To apply the following lemma, just 
replace ys by an appropriate value such that A, and B, have about 2n and n 
digits respectively, where n is the desired number of digits in the quotient; for 
example, we might choose jz = 3" to truncate to k words. 
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Lemma 3.10 Let A,B, € N*,2< uw < B. LetQ=|A/B], Ay = |A/pI, 
By = |B/u], Qi = |A;/B,|. If A/B < 2B), then 


Q<Q<Q+2. 


The condition A/B < 2B, is quite natural: it says that the truncated divisor 
By, should have essentially at least as many digits as the desired quotient. 


Proof. Let A; = Q,B, + R;. We have A = Ait Ap, B = Bi + Bo, 
therefore 

A _ Aip+ Ao z Aip+ Ao = 014 Rip + Ao 

B Byu+ Bo ~ By By 
Since Ry < By, and Ap < p, Rip t+ Ao < Bip, thus A/B < Q) +1. Taking 
the floor of each side proves, since Q, is an integer, that Q < Q,. 

Now consider the second inequality. For given truncated parts A; and B,, 
and thus given Q, the worst case is when A is minimal, say A = Aj, and B 
is maximal, say B = By + (u — 1). In this case, we have 

3 |= Aj Alp -| Ai(u— 1) 
By B By Bywt+(u-1) By (Bip +p—1)] 
The numerator equals A — A; < A, and the denominator equals B; B; there- 


fore, the difference A, /B, — A/B is bounded by A/(B,B) < 2, and so is the 
difference between Q and Q}. 


Algorithm ShortDivision is useful in the Karatsuba and Toom—Cook ranges. 
Tife Kpy idea is that, when dividing a 2n-digit number by an n-digit number, 
some work that is necessary for a full 2n-digit division can be avoided (see 
Figure 3.5). 


Algorithm 3.7 ShortDivision 
Input: 0 < A < 6?", 8"/2< B< p” 
Output: an approximation of A/B 
Require: a threshold no 

1: ifn < no then return | A/B| 
choose k > n/2,—n—k 
(A1, Ao) — (A div 674, A mod 8?) 
(B,, Bo) — (B div 6°, B mod 8°) 
(Qi, Ri) — DivRem(A;, B,) 
A’ — R18 + Ao — Qi Bop" 
Qo — ShortDivision(A’ div 6", B div 3*) 
return Q; 3° + Qo. 


OO ION a et 
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Theorem 3.11 The approximate quotient Q’ returned by ShortDivision 
differs at most by 2\g n from the exact quotient Q = |A/B|, more precisely 


Q<Q<Q+2lgn. 


Proof. If n < no, Q = Q’ so the statement holds. Assume n > no. We 
have A} A167! + Ap and B = B,G* + Bo; thus, since Ay = QB, + Ri, 
A= (Qi Bi + R,) 8?" + Ag _ Q, BB an A’, with A’ = gnte, Let A’ = 

1 B*+Ap, and B = Bi B*+Bi, withO < Ab, By < B*, and Ai < 67". From 
Lemma 3.10, the exact quotient of A’ div G* by B div 3* is greater or equal to 
tha by B; thus, by induction Qy > A’/B. Since A/B = Q,6° + A’/B, 
this proves that Q’ > Q. 

Now by induction, Qo < A{/Bi + 21g@, and A{/Bi < A’/B + 2 (from 
Lemma 3.10 again, whose hypothesis A’/B < 2B} is satisfied, since A’ < 
B, 6", thus A’/B < BY < 2Bi)FS0]Qo < A’/B + 2lgn, and Q’ < A/B+ 
2lgn. oO 


As shown at the lower half of Figure 3.5, we can use a short product to compute 
(1 Bo at step 6. Indeed, we need only the upper ¢ words of A’, and thus only the 
upper @ words of @) Bo. The complexity of Algorithm ShortDivision satisfies 
D*(n) = D(k) + M*(n—k)+ D*(n—k) with k > n/2, where D(n) denotes 
the cost of a division with remainder, and M/*(n) the cost of a short product. 
In the Karatsuba range, we have D(n) ~ 2M(n), M*(n) ~ 0.808M(n), 
and the best possible value of k is k ~ 0.542n, with corresponding cost 
D*(n) ~ 1.397M(n). In the Toom—Cook 3-way range, k © 0.548n is op- 
timal, and gives D*(n) ~ 1.988M(n). 


Barrett’ ing-point division algorithm 
Here we consider floating-point division using Barrett’s algorithm and provide 
a rigorous error bound (see §2.4.1 exact integer version). The algorithm 
is useful when the same divisor is used several times; otherwise Algorithm 
DivideNewton is faster (see Exercise 3.13). Assume we want to divide a by b 
of n bits, each with a quotient of 7 bits. Barrett’s algorithm is as follows: 


1. Compute the reciprocal r of b to n bits [rounding to nearest] 
2. q — On(a x r) [rounding to nearest] 


The cost of the algorithm in the FFT range is ~ 3/(n): ~2M(n) to compute 
the reciprocal with the wrap-around trick, and /(n) for the product a x r. 


Lemma 3.12 At step 2 of Barrett’s algorithm, we have |a — bq| < 3|b|/2. 
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M(%) 


M() 
M(n/4) 


M(q) 


M( a ) 
M(n/A4) 
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M( #) 
M(n/4) 


M(3) 


n 
MNG 


M(¥) 


M(3)|M*(2) 


M(¥) 


M(¥) 


M(B) M*(n/2) 


M(¥) 


Figure 3.5 Divide and conquer short division: a graphical view. Upper: with 
plain multiplication; lower: with short multiplication. See also Figure 1.3. 


Proof. By scaling a and b, we can assume that b and q are integers, that 
27-1 < bq < 2”; thus, a < 22”. We have r = 1/b 4+ € with 
Je] < ulp(2-"/2) = 27-2". Also q = ar +e’ with |e’| < ulp(q)/2 = 1/2 
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since q has n bits. Therefore, g = a(1/b+¢) +6’ = a/b+ae+e"’, and 
|bq — prs yollee +e'| < 3)b|/2. 
As a consequence of Lemma 3.12, qg differs by at most one unit in last place 
from the n-bit quotient of a and b, rounded to nearest. 

Lemma 3.12 can be applied as follows: to perform several divisions with a 
precision of n bits with the same divisor, precompute a reciprocal with n + g 
bits, and use the above algorithm with a working precision of n + g bits. If the 
last g bits of g are neither 000...00x nor 111...11a (where x stands for 0 
or 1), then rounding g down to n bits will yield o,,(a/6) for a directed rounding 
mode. 


Which algorithm to use? 

In this section, we described three algorithms to compute x/y: Divide-Newton 
uses Newton’s method for 1/y and incorporates the dividend x at the last 
iteration, ShortDivision is a recursive algorithm using division with remainder 
and short products, and Barrett’s algorithm assumes we have precomputed an 
approximation to 1/y. When the same divisor y is used several times, Barrett’s 
algorithm is better, since each division costs only a short product. Otherwise 
ShortDivision is theoretically faster than DivideNewton in the schoolbook 
and Karatsuba ranges, and taking & = n/2 as parameter in ShortDivision is 
close to optimal. In the FFT range, DivideNewton should be preferred. 


3.5 Square root 


Algorithm FPSqrt computes a floating-point square root, using as subroutine 
Algori SqrtRem (81.5.1 to determine an integer square root (with remain- 
der). It assumes an integer significand m, and a directed rounding mode (see 
Exercise 3.14 for rounding to nearest). 


Algorithm 3.8 FPSqrt 
Input: x = m- 2°, a target precision n, a directed rounding mode o 
Output: y = 0,,(\/z) 
if e is odd then (m’, f) — (2m, e — 1) else (m’, f) — (m,e) 
define m’ := m 12?" + mo, m, integer of 2n or 2n — 1 bits, 0 < mo < 2?" 
(s,r) — SqrtRem(m,) 
if (o is round towards zero or down) or (r = mo = 0) 
then return s - 2'+4/? else return (s + 1) - 2*+//2, 
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Theorem 3.13 Algorithm FPSqrt returns the correctly rounded square root 
of x. 


Proof. Since m, has 2n or 2n — 1 bits, s has exactly n bits, and we have 
x > s?2?k+F; thus, /z > s2*+//?, On the other hand, SqrtRem ensures that 
r < 2s, and 22-7 = (s? + r)27* +. mo < (s? +r + 1)2?* < (s+ 1)72”*. 
Since y := s- 2*+f/? and y+ = (s +1) - 2**+//? are two consecutive n-bit 
floating-point numbers, this concludes the proof. 


NOTE: Free s=2"— teal 1 = 2” is still representable in 7 bits. baal 


A different method is to use an initial approximation to the reciprocal square 
—1/2 (§3.5.1), see Exercise 3.15. Faster algorithms are mentioned in §3.8. 


root x 


3.5.1 Reciprocal square root 


In this section, we describe an algorithm to compute the reciprocal square root 
a~'/? of a floating-point number a, with a rigorous error bound. 


Lemma 3.14 Let a,x > 0, p = a~\/?, and a! = x + (a/2)(1 — ax”). Then 


for some 0 € [min(x, p), max(z, A 


Proof. The proof is very similar to that of Lemma 3.7. Here we use f(t) = 
a — 1/t?, with p the root of f. Eqn. (3.5) translates to 


323 
aga? — @)s 


&. 
p=et5 


which proves the Lemma. 


Lemma 3.15 Provided that 3 > 38, if X is the value returned by Algorithm 
ApproximateRecSquareRoot, a = AG~", « = XP~", then1/2 <a <1 
and 

|jc — a~1/?| 2o5-". Oo oO 
Proof. We have 1 < a < 4. Since X is bounded by 3” —1 at lines 1 and 9, 


we have x, 2), < 1, with x, = X,G7". We prove the statement by induction 
on n. It is true for n < 2. Now assume the value X;, at step 4 satisfies 


lm—a, “1S B 


where a;, = A;,,3—". We have three sources of error, that we will bound sepa- 
rately: 
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Algorithm 3.9 ApproximateRecSquareRoot 

Input: integer A with 0G” < A < 46", B > 38 
Output: integer X, 8"/2 < X < /" satisfying Lemma 3.15 
: ifm < 2 then return min(8” — 1, |6"/\/AB-"]) 
:£—|(n—-1)/2|,hen-2 

: Ano |Ap| 

X», — ApproximateRecSquareRoot(A;, ) 
TAX? 

: Tho |TB-" | 

Te — 6?" — Th 

UH Ty Xp, 

: return min(@” — 1, X1,pf + ee" /2)). 


=_i 


1. the rounding errors in steps 6 and 9; 

2. the_mathematical error given by Lemma 3.14, which would occur even if 
alFeomputations were exact; 

3. [He error coming from the fact we use Aj;, instead of A in the recursive call 
at step 4. 


At step 5 we have exactly 
= TQ-n = axe, Oo 


which gives |tp — ax? | < 6°?) with th := TpG-?", and in turn 
\te — (1 — ax?)| < 67?" with te := T,G-?". At step 8, it follows that 
|u — x,(1 — ax?)| < 6-?", where u = UG-*". Thus, after taking into 
account the rounding ferroy in the last step, |x — [x_, + x_(1 — ax?) /2]| < 
(G72) + B-™) /2. 

Now we apply Lemma 3.14 to 7 — xp, x’ — «, to bound the mathematical 
error, assuming no rounding error occurs 


373 
ll 0<al?_a< oa (a-/? — g,)?, 
which gives! |a~1/2 — 2| < 3.04(a~1/2 — a). Now |a~#/2 — a,'/?| < 
la — a,|v—3/2 /2 for v € [min(a,, a), max(ap, a)]; thus, |a~!/? — a | < 


a 
= 


G-" /2. Together with the induction hypothesis |a, — a, | < 267%, 
follows that |a~1/? — x,| < 2.537". Thus, |a~1/? — x| < 1987-2". 


1 Since 6 € [xp,a~!/?] and |ap, — a~!/?| < 2.587", we have 6 > x), — 2.56—", and 
ap/0<14+2.58-"/0 <1+58-" (remember 0 € [x,,a~!/?]), and it follows that 
0 > 1/2. For B > 38, since h > 2, we have 1 + 5B7? < 1.0035; thus, 
1.523 /04 < (1.5/0)(1.0035)? < 3.04. 
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The total error is thus bounded by 
3 
la” = x| < ae ae 198-2". 


Since 2h > n+ 1, we see hohe sci < B-"/2 for 6 > 38, and the lemma 
follows. 


Note: If A,X? < 62" at step 4 of Algorithm ApproximateRecSquareRoot, 
we could have AX? > 3”"*?" at step 5, which might cause Ty to be negative. 


Let R(n) be the cost of ApproximateRecSqu t for an n-digit in- 
put. We have h,& ~ n/2; thus, the recursive call costs R(n/2), sfep 5 costs 
M(n/2) to compute X7, and M(n) for the product AX? (or Mae in the 
FFT range using the wrap-around trick described in 3.4.1, since we know the 
upper n/2 digits of the product give 1), and again M(n/2) for step 8. We get 
R(n) = R(n/2) + 2M (n) (or R(n/2) + 7M (n)/4 in the FFT range), which 

ae R(n) ~ 4M (n) (or R(n) ~ 3.5M(n) in the FFT range). 

This algorithm is not optimal in the FFT range, especially when using an 
FFT algorithm with cheap point-wise products (such as the complex FFT, see 
§3.3.1). Indeed, Algorithm ApproximateRecSquareRoot uses the following 
form of Newton’s iteration 


ve =at+ =( — az’). 
2 
It might be better to write 


ve =at+ (0 —az®). 

Here, the product x? might be computed with a single FFT transform of length 
3n/2, replacing the point-wise products £? by £3, with a total cost ~0.75.M(n). 
Moreover, the same idea can be used for the full product ax® of 5n /2 bits, 
where the upper n/2 bits match those of x. Thus, using the wrap-around trick, 
a transform of length 2n is enough, ae cost of ~ M(n) for the last iter- 
ation, and a total cost of ~ 2M (n) e reciprocal square root. With this 
improvement, the algorithm of Exercise 3.15 costs only ~2.25M(n). 


3.6 Conversion 


Since most software tools work in radix 2 or 2”, and humans usually enter or 
read floating-point numbers in radix 10 or 10*, conversions are needed from 
one radix to the other one. Most applications perform very few conversions, 


U 
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in comparison to other arithmetic operations, fadthe efficiency of the conver- 
sions is rarely critical? The main issue here is therefore more correctness than 
efficiency. Correctness of floating-phiht conversions is not an easy task, as can 
be seen from the history of bugs in Microsoft Excel? 

The algorithms described in this section use as subroutines the integer- 
conversion algorithms from Chapter 1. As a consequence, their efficiency 
depends on the efficiency of the integer-conversion algorithms. 


3.6.1 Floating-point output 


In this section, we follow the convention of using lower-case letters for param- 
eters related to the internal radix b, and upper-case for parameters related to 
the external radix B. Consider the problem of printing a floating-point num- 
ber, represented internally in radix b (say b = 2) in an external radix B (say 
B = 10). We distinguish here two kinds of floating-point output: 


e Fixed-format output, where the output precision is given by the user, and 
we want the output value to be correctly rounded according to the given 
rounding mode. This is the usual method when values are to be used by 
humans, for example to fill a table of results. The input and output precisions 
may be very different, for example we may want to print 1000 digits of 2/3, 
which uses only one digit internally in radix 3. Conversely, we may want to 
print only a few digits of a number accurate to 1000 bits. 

e Free-format output, where we want the output value, when read with correct 
rounding (usually to nearest), to give exactly the initial number. Here the 
minimal number of printed digits may depend on the input number. This 
kind of output is useful when storing data in a file, while guaranteeing that 
reading the data back will produce exactly the same internal numbers, or for 
exchanging data between different programs. 


In other words, if x is the number that we want to print, and X is the printed 
value, the fixed-format output requires |x — X| < ulp(X), and the free-format 
output requires |a~ — X| < ulp(a) for directed rounding. Replace < ulp(-) by 
< ulp(-)/2 for rounding to nearest. 

Some comments on Algorithm PrintFixed: 


e It assumes that we have precomputed values of Ag = o(logb/ log B) for 


2. An important exception is the computation of billions of digits of constants like 77, log 2, 
where a quadratic conversion routine would be far too slow. 

3 In Excel 2007, the product 850 x 77.1 prints as 100, 000 instead of 65, 535; this is really an 
output bug, since if we multiply “100, 000” by 2, we get 131,070. An input bug occurred in 
Excel 3.0 to 7.0, where the input 1.40737488355328 gave 0.64. 
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gorithm 3.10 PrintFixed 


Input: x = f - b°-? with f,e,p integers, b?-! < |f| < b?, external radix B 


and precision P, rounding mode o 


Output: X = F- B’-P with F, F integers, BP-! < |F| < B?, such that 


X = o(x) in radix B and precision P 
: \ — o(log b/log B) 
: B—1+4+ [(e- 1A} | 
: qe [P/M Oo 
y — o(xBP-) with precision q 
: if we cannot round y to an integer then increase g and go to step 4 
: F <— Integer(y, 0). > see §1.7 
: if|F| > BP then E — E +1 and go to step 4. 
: return FY E. 


any possible Sxtetadeeatnc B (the internal radix b is assumed to be fixed for 
a given implementation). Assuming the input exponent e is bounded, it is 
possible — see Exercise 3.17 — to choose these values precisely enough that 


B=1+|(e-1) oer | (3.8) 


Thus, the value of \ at step | is simply read from a table. 


The difficult part is step 4, where we have to perform the exponéntiation 
BP- _ remember all computations are done in the internal radix b — and 
multiply the result by x. Since we expect an integer of q digits in step 6, there 
is no need to use a precision of more than q digits in these computations, 
butra rigorous bound on the rounding errors is required, so as to be able to 
correctly round y. 

In step 5, we can round y to an integer if the interval containing all pos- 
sible values of «B?~© — including the rounding errors while approaching 
xBP-*, and the error while rounding to precision g — contains no rounding 
boundary (if o is a directed rounding, it should contain no integer; if o is 
rounding to nearest, it should contain no half-integer). 


Theorem 3.16 Algorithm PrintFixed is correct. CI 


Proof. First assume that the algorithm finishes. Eqn. (3.8) implies B’~! zi 
b°—1; thus |2|B?-” > BP-!, which implies that |F| > B?—! at step 6. 
Therefore B’-! < |F| < BP at the end of the algorithm. Now, printing x 
gives F’- B* iff printing 7B* gives F. B°+* for any integer k. Thus, it suffices 


to 


check that printing «B?~" gives F, which is clear by construction. 


O 
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The algorithm terminates because at step 4, x B?~” 


be arbitrarily close to an integer. If P— E’ > 0, let k be the number of digits of 
BP-* in radix b, then B?—" can be represented exactly with p + k digits. 
IfP-—E<0,letg = BEY?, of k digits in radix b. Assume f/g = n+ 
with n integer; then f — gn = ge. If € is not zero, ge is a non-zero integer, and 
lel > d/o > 2-*, 

The case |F'| > B? at step 7 can occur for two reasons: either |a2|B?P-” > 
BY] and its rounding also satisfies this inequality; or |a| BP -E < BP, but 
its rounding equals B” (this can only occur for rounding away from zero or 
to nearest). In the former case, we have |a|BP -E > BP! at the next pass 
in step 4, while in the latter case the rounded value F equals B?~ and the 
algorithm terminates. 


, if not an integer, can not 


Now consider free-format output. For a directed rounding mode, we want 
|c — X| < ulp(x) knowing |x — X| < ulp(X). Similarly, for rounding to 
nearest, if we replace ulp by ulp /2. 

It is easy to see that a sufficient condition is that ulp(X) < ulp(x), or 
equivalently B’-P < 6°” in Algorithm PrintFixed (with P not fixed at 
input, which explain the “‘free-format” name). To summarize, we have 


eo <pl<r, Bet <x) <8". 
Since |z| < b°, and X is the rounding of «, it suffices to have BE-1 < be. It 
follows that BE~? < b&B!~?,, and the above sufficient condition becomes 


log b 


Peo 2 pee 


For example, with b = 2 and B = 10, p = 53 gives P > 17, and p = 24 gives 
P > 9. As a consequence, if a double-precision IEEE 754 binary floating- 
point number is printed with at least 17 significant decimal digits, it can be read 
back without any discrepancy, assuming input and output are performed with 
correct rounding to nearest (or directed rounding, with appropriately chosen 
directions). 


3.6.2 Floating-point input 


The problem of floating-point input is the following. Given a floating-point 
number X with a significand of P digits in some radix B (say B = 10), a 
precision p and a given rounding mode, we want to correctly round X to a 
floating-point number x with p digits in the internal radix b (say b = 2). 
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At first glance, this problem looks very similar to the floating-point output 
problem, and we might think it suffices to apply Algorithm PrintFixed, simply 
exchanging (b, p, e, fland (B, P, E, F). Unfortunately, this is not the case. 
The difficulty is that, in Algorithm PrintFixed, all arithmetic operations are 
performed in the internal radix b, and we do not have such operations in radix 
B (see however Exercise 1.37). 


— 


3.7 Exercises 


Exercise 3.1 In §3.1.5, we described a trick to get the next floating-point num- 
ber in the direction away from zero. Determine for which IEEE 754 double- 


precision nun he trick works. 


Exercise 3.2 (Kidder, B ) Assume a binary representation. The “rounding 
to odd” mode [42, 148, ae 

representable, it rounds to the unique adjacent number with an odd significand. 
(“Von Neumann rounding” [42] omits the test for the exact value being repre- 
sentable or not, and rounds to odd in all non-zero cases.) Note that overflow 
never occurs during rounding to odd. Prove that if y = round(x, p + k, odd) 
and z = round(y, p, nearest_even), and k > 1, then 


is defined as follows: in case the exact value is not 


z = round(2, p, nearest_even) 


i.e. the double-rounding problem does not occur. 


Exercise 3.3 Show that, if \/a is computed using Newton’s iteration for a~!/? 


4 of =+5(1— a2") 


(see §3.5.1), and the identity \/a = a x a~*/?, with rounding mode “round to- 
wards zero’, then it might never be possible to determine the correctly rounded 
value of \/a, regardless of the number of additional guard digits used in the 
computation. 


Exercise 3.4 How truncating the operands of a multiplication to n + g 
digits (as suggested in §3.3) affect the accuracy of the result? Considering the 
cases g = 1 and g > 1 separately, what could happen if the same strategy were 
used for subtraction? 

ze 


Exercise 3.5 Is the bound of Theorem 3.5 optimal? 


L 
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Exercise 3.6 Adapt Mulders’ short product algorithm [173] to floating-point 
numbers. In case the first rounding fails, can you compute additional digits 
without starting again from scratch? 


Exercise 3.7 Show that, if a balanced ternary system is used (radix 3 with 
digits {0,+1}), then “round to nearest” is equivalent to truncation. 


Exercise 3.8 (Percival) Suppose we compute the product of two complex 
floating-point numbers z = do + ibo and z1 = a, + 7b, in the follow- 
ing way: %q = o(aga1), 2» = O(bob1), Ya = O(aob1), yo = O(aib0), z = 
0(%q—Xp)+10(Ya+yp). All computations are done in precision n, with round- 
ing to nearest. Compute an error bound of the form |z — z0z1| < c2~” |zdza] 
What is the best possible constant c? 


Exercise 3.9 Show that, if 4 = O(¢) and ne < 1, the bound in Theorem 3.6 
simplifies to 


Iz" = 2lloo = O(|x| - ly - ne). 


If the rounding errors cancel, we expect the error in each component of z’ to be 
O(|a|-|y|-n!/2e). The error ||z’ — z||,. could be larger bel 

of N = 2” component errors. Using your favourite implementation of the 
FFT, compare the worst-case error bound given by Theorem 3.6 with the error 
||2’ — z||.o that occurs in practice. 


is amaximum 


Exercise 3.10 (Enge) Design an algorithm that correctly rounds the product 
of two complex floating-point numbers with 3 multiplications only. [Hint: as- 
sume all operands and the result have n-bit significand.] 


Exercise 3.11 Write a computer program to check the entries of Table 3.3 are 
correct and optimal, given Theorem 3.6. 


Exercise 3.12 (Bodrato) Assuming we use an FFT modulo 3” — 1 in the 
wrap-around trick, how should we modify step 5 of ApproximateReciprocal? 


Exercise 3.13 To perform k divisions with the same divisor, which of Algo- 
rithm DivideNewton and Barrett’s algorithm is faster? 


Exercise 3.14 Adapt Algorithm FPSqrt to the rounding to nearest mode. 


Exercise 3.15 Devise an algorithm similar to Algorithm FPSqrt but using Al- 
gorithm ApproximateRecSquareRoot to compute an n/2-bit approximation 
—1/2 and doing one Newton-like correction to return an n-bit approxima- 
1/2 Tn the FFT range, your algorithm should take time ~ 3M (n) (or 


to x 
tion to x 
better). 
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Exercise 3.16 Prove that for any n-bit floating-point numbers (x, y) 4 (0,0), 

and if all computations are correctly rounded, with the same rounding mode, 

the result of x/./a? + y? lies in [—1, 1], except in a special case. What is this 
ecial case and for what rounding mode does it occur? 


Exercise 3.17 Show that the computation of in Algorithm PrintFixed, 
step 2, is correct -ie. EH = 1+ |(e—1)logb/log B| — as long as there is 
no integer n such that |n/(e — 1) log B/ log b — 1| < e, where ¢ is the relative 
precision when computing A: \ = log B/ log b(1 + @) with |0| < ¢. Fora 
fixed range of exponents —€max < € < €max, deduce a working precision ¢. 
Application: for b = 2, and emax = 2°', compute the required precision for 
38<B< 36. 


Exercise 3.18 (Lefévre) The IEEE 754-1985 standard required binary to dec- 
imal conversions to be correctly rounded in double precision in the range 
{m-10" : |m| < 1017 — 1,|n| < 27}. Find the hardest-to-print double- 
precision number in this range (with rounding to nearest, for example). Write 
a C program that outputs double-precision numbers in this range, and compare 
it to the sprintf C-language function of your system; similarly, for a con- 
version from the IEEE 754-2008 binary64 format (significand of 53 bits, 
Q-1074 < |x| < 21974) to the decimal 64 format (significand of 16 decimal 
digits). 


Exercise 3.19 The same question as in Exercise 3.18, but for decimal to binary 
conversion, and the atof C-language function. 


i] 3.8 Notes and references 


In her Ph.D. thesis [162, Chapter V], Valérie Ménissier-Morain di es con- 
tinued fractions and redundant representations as alternatives to lassical 
non-redundant representation considered here. She also considers [162, Chap- 
ter III] the theory of computable reals, their representation by B-adic numbers, 
and the computation of algebraic or transcendental functions. 

Other representations were designed 'te-increase the range of representable 
values; in particular Clenshaw and Olver [70] invented level-index arithmetic, 
where for example 2009 is approximated by 3.7075, since 2009 *& 
exp(exp(exp(0.7075))), and the leading 3 indicates the number of iterated ex- 
ponentials. The obvious drawback is that it is expensive to perform arithmetic 
operations such as addition on numbers in the level-index representation. 


a 
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Clenshaw and Olver [69] also introduced the concept of an unrestricted 
algorithm (meaning no restrictions on the precision or exponent e). 
Several such algorithms were described in Brent [48]. LT 

Nowadays most computers use radix two, but other_choices (for example 
radix 16) were popular in the past, before the widespréad ddoption of the IE 
754 standard. A discussion of the best choice of radix is given in Brent [42]. 

For a general discussion of floating-point addition, rounding modes, the 
sticky bit, etc., see Hennessy, Patterson, and Goldberg [120, Appendix AA].4 

The main reference for floating-point arithmetic is the IEEE 754 standard 
[5]. which defines four binary formats: single precision, single extended (dep- 
recated), double precision, and double extended. The IEEE 854 standard de- 
fines radix-independent arithmetic, and ae ae arithmetic — see Cody 
et al. [72]. Both standards were replaced by the revision of IEEE 754 (approved 
by the IEEE Standards Committee on J 12, 2008). 

We have not found the source of Theorem 3.2 — ge to be “folklore”. 
The rule regarding the precision of a result, given possibty’differing precisions 
of the operands, was considered by Brent [49] and Hull [127]. 

Floating-point expansions were introduced by Priest [186]. They are mainly 
useful for a small number of summands, typically two or three, and when the 
main operations are additions or subtractions. For a larger number of sum- 
mands, the combinatorial logic becomes complex, even for addition. Also, 
except in simple cases, it seems difficult to obtain correct rounding with 
expansions. 

Some gbod fefere nces on error analysis of floating-point algorithms are the 
books by Higham [121] dnd Muller [174]. [Oldpr references include Wilkin- 
son’s Classics [228, 229]. 

Collins and Krandick [74], and Lefévre [153], proposed algorithms for 
multiple-precision floating-pojnt atidition. 

T oblem of leading zero anticip and detection in hardware is classi- 
cal; se@ Schmookler and Nowka a a comparison of different methods. 
Theorem 3.4 may be found in S nz [210]. 

The idea of having a “short ae, together with correct rounding was 
studied by Krandick and Johnson [145]. They attributed the term pn Le 
uct” to Knuth. They considered both the schoolbook and the Karatsu 
mains. Algorithms ShortProduct and ShortDivision are due to Mulders [173]. 
The problem of co tive zeros or ones — also called runs of zeros or ones — 
has been studied veral authors in the context df chmputer arithmetic: 
Tordache and Matula [129] studied division (Theorem 3.9), square root, and 


4 We refer to the first edition as later editions may not include the relevant Appendix by 
Goldberg. 
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reciprocal square root. Muller and Lang [151] genratizg pad est to alge- 
braic functions. 

The fast Homer transform (F sing complex floating-point numbers and 
the Schénhdge-Strassen pa rae described in Knuth [142]. Many varia- 
tions of the FFT are discussed in the books by Crandall [79, 80]. For further 
references, see §2.9. 

Theordm_}.6 is from Percival [183]; previous rigorous error analyses of 
complex FFT gave very pessimistic bounds. Note that the erroneous proof 

[ given in [183] was corrected by Brént] Percival, and Zimmermann [55] (see 
also Exercise 3.8). 

The concept of “middle product” for power series is discussed in Hanrot et 
al. [111]. Bostan, Lecerf, and Schost [40] have shown that it et seen as 
a special case of “Tellegen’s principle”, and have generalized it fo operations 
other than multiplication. The link between usual multiplication and the mid- 
dle product using trilinear forms was mentioned by Victor Pan [181] f 
multiplication of two complex numbers: “The duality technique enables 
extend any successful bilinear algorithms to two new ones for the new prob- 
lems, sometimes quite different from riginal problem - --”. Harvey [115] 
has shown how to efficiently imple the middle product for integers. A 
detailed and comprehensive description of the Payne and Hanek argument re- 
duction method can be found in Muller [174]. CL] 

In this section, we drop the ““~” that strictly should be indluded in the com- 
plexity bounds. The 2M (n) reciprochl dlgorithm of §3.4.1 — with the wrap- 
around trick — is due to Schénhage, Grotefeld, and Vetter [198]. It can be 
improved, as noticed by Dan Bernstein [20]. If we keep the FFT-transform of 
x, we far save M(n)/3 (assuming the term-to-term products haye negligible 
cost), which gives 5M(n)/3. Bernstein also proposes a “messy” 31/(n)/2 
algorithm [20]. Schénhage’s 3M (n)/2 algorithm is simpler [197]. The idea 
is to write Newton’s iteration as x’ = 27 — ax”. If x is accurate to n/2 bits, 
then ax has (in theory) 2n bits, but we know the upper n/2 bits cancel with 
x, and we are not interested in the low n bits. Thus, we can perform modu- 
lar FFTs-ofjsize 3n/2, with cost 1/(3n/4) for the last iteration, and 1.52 (n) 
overall. 1.5M(n) bound for the reciprocal was improved to 1.444 (n) 
by Harvey [116]. See also Cornea-Hasegan, Golliver, and Markstein [78] for 
the roundoff erro lysis when using a floating-point multiplier. 

The idea of Hh ee the dividend in Algorithm DivideNewton is due to 
Karp and Markstein [137], and is usually known as the Kdgp—Markstein trick; 
we already used it in Algorithm ExactDivision in Chapter 1. The asymptotic 
complexity 5M (n)/2 of floaling]point division can be improved to 5M (n)/3, 
as shown by van der Hoeven in [125]. Another well-known method to perform 
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a floating-point division is Goldschmidt’s iteration: starting from a/b, first find 
c such that b; = cb is close to 1, and a/b = a/b; with a, = ca. At step 
k, assuming a/b = apz/by, we multiply both a, and by, by 2 — bx, giving 
@p41 and b;,41. The sequence (b,) converges to 1, and (a;,) converges to a/b. 
Goldschmidt’s iteration works because, if b, = 1+ €, with e¢, small, then 
beg = (1+ ex)(1 — ex) = 1 — €?. Goldschmidt’s iteration admits quadratic 
convergence as does Newton’s method. However, unlike Newton’s method, 
Goldschmidt’s iteration is not self-correcting. Thus, it yields an arbitrary pre- 
cision division with cost O(M(n) logn).|For this reason, Goldschmidt’s it- 
eration should only be used for small, fixed precision. A detailed analysis of 
Goldschmidt’s algorithms for division and square root, and a comparison with 
Newton’s method, is given in Markstein [158]. 

Bernstein [20] obtained faster square root algorithms in the FFT domain, by 
cackrimgjsome Fourier transforms. More precisely, he obtained 111 (n)/6 for 
the lak root, and 5M (n)/2 for the simultaneous computation of x!/? and 


a ae for is il acd root was reduced to 4M(n)/3 by 
Harvey F 


Classical floating-point conversion algorithms are due to Steele and White 
[207], Gay [103], and Clinger [7Lmost of these author, ume fixed pre- 
cision. Cowlishaw maintains an ive bibliography A ipa na to and 
from decimal formats (see $5.3). What we call “free-format” output is called 

Et empotent conversion” by Kahan [133]; see also Knuth [142, exercise 4.4- 
18]. Another useful reference on bihaty to decimal conversion is Cornea et 
al. [77]. 

Biirgisser, Clausen, and Shokrollahi [60] is an excellent book on topics such 
as lower bounds, fast multiplication of numbers and polynomials, Strassen-like 
algorithms for matrix multiplication, andl thd tensor rank problem. 

There is a large literature on interval arithmetic, which is outside the scope 
of this chapter. A recent book is Kulisch [149], and a good entry point is the 
Interval Computations web page (see Chapter 5). 

In this chapter, we did not consider complex arithmetic, except where rel- 

O evant for its use in the FFT. An algorithm for the co (floating-point) 
square root, which allows correct rounding, is given in ovac and Muller 
[91]. See also the comments on Friedland’s algorithm in §4.12. 
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Elementary and special function evaluation 


Here we consider various applications of Newton’s method, 
which can be used to compute reciprocals, square roots, and more 
generally algebraic and functional inverse functions. We then 
consider unrestricted he for computing elementary and 
special functions. The algorithms of this chapter are presented at 
a higher level than in Chapter 3. A full and detailed analysis of 
one special function might be the subject of an entire chapter! 


4.1 Introduction 


This chapter is concerned with algorithms for computing elementary and 
special functions, although the methods apply more generally. First we con- 
sider Newton’s method, which is useful for COMPUTE inverse functions. For 
example, if we have an algorithm for computing y = In, then Newton’s 
method can be to compute x = expy (see §4.2.5). However, Newton’s 
method has aa applications. In fact, we already mentioned Newton’s 
method in Chapters 1-3, but here we consider it in more detail. 

sid considering Newton’s meth e go on to consider vari ethods 
for putin mentary and speci ctions. These ods ae power 
series ($4.4), ptotic expansions (84.5), continued ions (84.6), recur- 
ence relations (84.7), the ari ic-geometric mean ($4.8), binary splitting 
(84.9), and contour integration (§4.10). The methods that we consider are un- 
restricted in the sense that there is no restriction on the attainable precision — 
in particular, it is not limited to the precision of IEEE standard 32-bit or 64-bit 
floating-point arithmetic. Of course, this depends on the availability of a suit- 
able software package for performing floht|ng-point arithmetic on operands of 
arbitrary precision, as discussed in Chapter 3. 
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Unless stated explicitly, we do not consider rounding issues in this chapter; 
it is assumed that methods described in Chapter 3 are used. Also, to simplify 
the exposition, we assume a binary radix (3 = 2), although most of the content 
could be extended to any radix. We recall that n denotes the relative precision 
(in bits here) of the desired approximation; if the absolute computed value is 
close to 1, then we want an approximation to within 2~”. 


4.2 Newton’s method 


Newton’s method is a major tgetin arbitrary-precision arithmetic. We have al- 
ready seen it or its p-adic counterpart, namely Hensel lifting, in previous chap- 
ters (see for example Algorithm ExactDivision in §1.4.5, or the iteration (2.3) 
to compute a modular inverse in §2.5). Newton’s method is also useful in small 
precision: most modern processors only implement addition and multiplication 
in jee 9 division and square root are microcoded, using either Newton’s 
me a fu Itiply-add instruction is available, or the SRT algorithm. 
See the algorithms to compute a floating-point reciprocal or reciprocal square 
root in §3.4.1 and §3.5.1. —_ 
This sdctionIdiscusses Newton’s method isLmard detail, in the context of 
L fideting-point computations, flor_thée computation of inverse roots (§4.2.1), 
reciprocals[_(§4.4.2), reciprocal square roots (§4.2.3), formal power series 
(84.2.4), and functional inverses (§4.2.5). We also discuss higher-order Newton- 
like methods (§4.2.6). 


Newton’s method via linearization 
Recall that a function f of a real variable is said to have a zero ¢ if f(¢) = 0. 
If f is differentiable in a neighborhood of ¢, and f’(¢) 4 0, then ¢ is said to be 
a simple zero. Similarly, for functions of several real (or complex) variables. In 
the case of several variables, ¢ is a simple zero if the Jacobian matrix evaluated 
at ¢ is non-singular. 

Newton’s method for approximating a simple zero ¢ of f is based on the idea 
of making successive linear approximations to f(x) in apeighborhood of ¢. 
Suppose that xo is an initial approximation, and that f (obh 
derivatives in the region of interest. From Taylor’s theorem 


2 
(¢ = 0) 
2 
ere we use Taylor’s theorem at xo, since this yields a formula in terms of derivatives at xo, 


hich is known, instead of at ¢, which is unknown. Sometimes (for example in the derivation 
of (4.3)), it is preferable to use Taylor’s theorem at the (unknown) zero ¢. 


as two continuous 
1 


f(C) = f(wo) + (C — 20) f"(@o) + ote) (4.1) 


a 
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for some point € in an interval including {¢, zo}. Since f(¢) = 0, we see that 
©, = 2 — f(x0)/f" (x0) 
is an approximation to ¢, and 
1 —¢=O(|zo—¢|’). 
Provided 9 is sufficiently close to ¢, we will have 
|t1 — 6] S |ao — ¢|/2 < 1. 


This motivates the definition of Newton’s method as the iteration 


jai = 25 — — Feat oe (4.2) 


Provided |29 — ¢| is sufficiently small, we expect x, to converge to ¢. The 
order of convergence will be at least two, i.e. 


lentil < Klen|? 


for some constant JC independent of n, where e,, = 2, — ¢ is the error after n 
iterations. 
A more careful analysis shows that 


bagel 

2f'(¢) 

provided f € C® near ¢. Thus, the order of convergence is exactly two if 

f"(¢) Z 0 and ep is sufficiently small but non-zero. (Such an iteration is also 
said to be quadratically convergent.) 


2 +0 (le3), (4.3) 


4.2.1 Newton’s method for inverse roots 
Consider applying Newton’s method to the function 
f(z) =U = a”, 


where m is a positive integer constant, and (for the moment) y is a positive 
constant. Since f’(x) = ma~°"+), Newton’s iteration simplifies to 


Ljp1 = Xj +2j(1—axjy)/m. (4.4) 
This iteration converges to ¢ = y—!/" provided the ibiadi approximation 29 
is sufficiently close to ¢. It is perhaps surprising that (4.4) does not involve 
divisions, except for a division by the integer constant m. In particular, we can 
easily compute reciprocals (the case m = 1) and reciprocal square roots (the 
case m = 2) by Newton’s method. These cases are sufficiently important that 
we discuss them separately in the following subsections. 
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4.2.2 Newton’s method for reciprocals 


Taking m = 1 in (4.4), we obtain the iteration 


which we expect to converge to 1/y, provided 29 is a sufficiently good approx- 
imation. (See §3.4.1 for a concrete algorithm with error analysis.) To see what 
“sufficiently good” means, define | 


uj =1—ay;y. 
Note that u; — 0 iff 7; — 1/y. Multiplying each side of (4.5) by y, we get 
1—ujy1 = (1 —uy)(1 + uy), 
which simplifies to 


Uj41 = UG. (4.6) 


Thus 
QI 


We see that the iteration converges iff |uo| < 1, which (for real chank y) is 
equivalent tolthekondition xoy € (0,2). Second-order convergence is reflected 
in the double exponential with exponent oh the right-hand side of (4.7). 

The iteration (4.5) is sometimes implemented in hardware to compute re- 
ciprocals of floating-point numbers (see 84.12). The sign and exponent of the 
floating-point number are easily handled, so we can assume that y € [0.5, 1.0) 
(recall we assume a binary radix in this chapter). The initial approximation 29 
is found by table lookup, where the table is indexed by the first few bits of y. 
Since the order of convergence is two, the number of correct bits approxirfigtely 
doubles at each iteration. Thus, we can predict in advance how many iterations 
are required. Of course, this assumes that the table is initialized correctly? 


Computational [esuks 
At first glance, it seems better to replace Eqn. (4.5) by 


Ty41 = ©; (2 — wyy), (4.8) 


which looks simpler. However, although those two forms are mathenatidally 
equivalent, they are not computationally equivalent. Indeed, in Eqn. (4.5), if 
x; approximates 1/y to within n/2 bitsrithen 1—a2jy = O(2-"/?), and the 
2 In the case of the infamous Pentium fdiv bug [109, 175], a lookup table used for division 


was initialized incorrectly, and the division was occasionally inaccurate. In this case division 
used the SRT algorithm, but the moral is the same — tables must be initialized correctly. 
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product of x; by 1 — x;y might be computed with a precision of only n/2 bits. 
In the apparently simpler form (4.8), 2—a;y = 1+ O(2-"/?), and the product 
of x; by 2— x;y has to be performed with a full precision of n bits to get 7j+1 
accurate to within n bits. 

As a general rule, it is best to separate the terms of different order in New- 
ton’s iteration, and not try to factor common expressions. For an exception, see 
the discussion of Schénhage’s 3M (n) /2 reciprocal algorithm in §3.8. 


C] 
4.2.3, Newton’s method for (reciprocal) square roots 
Taking m = 2 in (4.4), we obtain the iteration 
Bi41 = 05 + 0,(1 — a5y)/2, (4.9) 
which we expect to converge to y—!/? 


proximation. 
If we want to compute y!/?, we can do this in one multiplication after first 


—1/2 
= 


provided 29 is a sufficiently good ap- 


computing y , since 


1/ 1/2 


pray ey 


This mesh does not involve any divisions (exceptby_b, see Exercise3.15). 
In contrast, if we apply Newton’s method to the function f(x) = x? — y, we 
obtain Heron’s” iteration (see Algorithm SqrtInt in §1.5.1) for the square root 
of y 


1 y 
This rqquires a division by x; at iteration j, so it is essentially different from 
the iteration (4.9). Although both iterations oe convergence, 
we exp .9) to be more efficient (however this depends on the relative cost 


of division compared to multiplication). See also §3.5.1 and, for various opti- 
mizations, §3.8. 


4.2.4 Newton’s method for formal power series 


This section is not required for function evaluation, however it gives a comple- 
mentary point of view on Newton’s method, and h: ns to computing 
constants such as Bernoulli numbers (see Exercises 4.41—4.42). 

Newton’s method can be applied to find roots of functions defined by for- 
mal power series as well as of functions of a real or complex variable. For 


3 Heron of Alexandria, circa 10-75 AD. 
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simplicity, we consider formal power series of the form 
A(z) = Ft aie + ase? +--- , 


where a; € R (or any field of characteristic zero) and ord(A) = 0, i.e. ag 4 0. 

For example, if we replace y in (4.5) by 1 — z, and take initial approximation 
xo = 1, we obtain a quadratically convergent iteration for the formal power 
series 


(l1—z)7?= ‘> Zz 
n=0 
In the case of formal power series, “quadratically convergent” means ml 
ord(e;) —+ +00 like 2%, where e,; is the difference between the desired 
result and the jth approximation. In our example, with the notation of 84.2.2, 


J 
ug = 1— xoy = 2, 80 uj; = 2” and 


_l-u 1 : Qi 
i 1-2z wa t Ol? ). 


Given a formal power series A(z) = ) 759 4j 2), we can define the formal 
derivative 


A'(z) = ae =a, +2agz+ 3a327 fees, 
j>o0 


and the integral 


but there is no useful analogue for multiple-precision intepery ee a; 4. 
This means that some fast algorithms for operations on power series have no 
analogue for operations on integers (see for example Exercise 4.1). 


4.2.5 Newton’s method for functional inverses 


Given a function g(2), its functional inverse h(a) satisfies g(h(x)) = x, and 
is denoted by h(x) := g(x). For example, g(x) = Inx and h(a) = expa 
are functional inverses, as eal = tan and h(x) = arctan a. Using the 
function f(a”) = y — g(x) in (4.2), we get a root ¢ of f, i.e. a value such that 


96) = y, or 6 = gy) 
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Since this iteration only involves g and g’, it provides an efficient way to eval- 
uate h(y), assuming that g(x,;) and g’(x;) can be efficiently computed. More- 
over, if the complexity of evaluating g’ — and of division — is no greater than 
that of g, we get a means to evaluate the functional inverse h of g with the same 
order of complexity as that of g. 

As an example, if one has an efficient implementation of the logarithm, a 
similarly efficient implementation of the exponential is deduced as follows. 
Consider the root e” of the function f(a) = y—Inz, which yields the iteration 


Lj41 = 2; +2;(y—Ina;), (4.11) 
and in turn Algorithm LiftExp (for the sake of simplicity, we consider here 


only one Newton iteration). 


Algorithm 4.1 LiftExp 
Input: x;, (n/2)-bit approximation to exp(y) 
Output: 2,1, n-bit approximation to exp(y) 


t<—Ina; > t computed to n-bit accuracy 
u—y-t > u computed to (n/2)-bit accuracy 
vU— xu > v computed to (n/2)-bit accuracy 


Lj41— Lj +v. 


4.2.6 Higher-order Newton-like methods 


The classical Newton’s method is based on a linear approx{Mation of f (a) near 
xo. If we use a higher-order approximation, we can get a higher-order method. 
Consider for example a second-order approximation. Eqn. (4.1) becomes: 


— £9)" — 2)3 
f(C) = Flo) + (620) (ao) + SE (mu 
Since f(¢) = 0, we have 


f(wo) _ (¢— a0)? f" (ao) 3 
¢ =20 Fo) 5 Fao tt |S = t9)"). (4.12) 
A difficulty here is that the right-hand side of (4.12) involves the unknown ¢. 
Let ¢ = xo — (e0)/f'laob-rad where v is a second-order term. Substituting 
this in the right-hand side of (4.12) and neglecting terms of order (¢ — 29)? 
yields the cubic iteration 


f" (x0) + fr'(). 


f(xj) _ Flay)? f"(@5) - 


THES I Fas) 2f"(ay)? 


= 
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For the computation of the reciprocal (§4.2.2) with f (2) /«, this yields 


tio = tj +2j(1— zy) +2;(1 — zy)’. (4.13) 


For the computation of exp y using functional inversion (84.2.5), we get 


1 
Lj = tj + 2;(y — Ina) 4 5tily Inaz;)?. (4.14) 


These iterations can be obtained in a more systematic way that generalizes to 
give iterations of arbitrarily high order. For the computation of the reciprocal, 
lete; = 1—ax;y, so x;y = 1 — ¢, and (assuming |e;| < 1), 


k— 
j 


De =praptentep+- tee) (4.15) 
for the reciprocal. The case k = 2 corresponds to Newton’s method, and the 
case k = 3 is just the iteration (4.13) that we derived above. 


Truncating after the term c*~' gives a kth-order iteration 


Similarly, for the exponential we take ¢; = y — Ina; = In(a/z;), so 


CoO Lum 


/ 5 2 
a/x; = expe; = . 
- z m! 


m=0 


Truncating after & terms gives a kth-order iteration 


k-1 mm 
e” 
Lj41 = 2y (x = le) 


for the exponential function. The case & = 2 corresponds to the Newton itera- 
tion, the case k = 3 ts (4.14) that we derived above, and the cases 
k > 3 give higher-order Newton-like iterations. For a generalization to other 
functions, see Exercises 4.3, 4.6. 


4.3 Argument reduction 


Argument reduction is a classical method to improve the efficiency of the eval- 
uation of mathematical functions. The key idea is to reduce the initial problem 
to a domain where the function is easier to evaluate. More precisely, given f 
to evaluate at x, we proceed in three steps: 
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e argument reduction: x is transformed into a reduced argument x’; 
e evaluation: f is evaluated at x’; 
e reconstruction: f(x) is computed from f(x’) using a functional identity. 


In some cases, the argument reduction or the reconstruction is trivial, for ex- 
ample x’ = 2/2 in radix 2, or f(x) = +f(x’) (some examples illustrate this 
below). It might also be that the evaluation step uses a different function g 
instead of f; for example, sin(x + 1/2) = cos(z). 

Unfortunately, argument reduction formule do not exist for every function; 
for example, no argument reduction is known for the error function. Argument 
reduction is only possible when a functional identity relates f(x) and f(x’) (or 
g(x) and g(x’)). The elementary functions have addition formulae such as 


exp(x + y) = exp(x) exp(y), 

log(zy) = pee ) + log(y), 
) = sin(2) cos(y) + cos(z) sin(y), 
= pare ) + tan(y) _ 

1 — tan(x) tan(y) 


sin(x + y 


tan(a + y (4.17) 


We use these formule to reduce the argument so that power series converge 
more rapidly. Usually we take x = y to get doubling formulae such as 


exp(2x) = exp(z)?, (4.18) 
though occasionally tripling formulae such as 
sin(3x) = 3sin(x) — 4sin*(a) 


might be useful. This tripling formula only invotvps oneffanction (sin), whereas 
the doubling formula sin(22) = 2sin x cos x involves two functions (sin and 
cos), but this problem can be overcome: see §4.3.4 and §4.9.1. 

We usually distinguish two kinds of argument reduction: 


e Additive argument reduction, where x' = x — kc, for some real constant 
c and some integer k. This occurs in particular when f(a) is periodic, for 
example for the sine and cosine functions with c = 27. 

e Multiplicative argument reduction, where x! = ze" for some real constant 
cand some integer k;. This occ ith c the computation of exp x 
when using the doubling formula (4.18): see §4.3.1. 


Note that, for a given function, both kinds of argument reduction might be 
available. For example, for sina, we might either use the tripling formula 
sin(3x) = 3sinx — 4sin® 2, or the additive reduction sin(xz + 2k7) = sina 
that arises from the periodicity of sin. 
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Sometimes “reduction” is not quite the right word, since a functional identity 
is used to increase rather than to decrease the argument. For example, the 
Gamma function I(x) satisfies an identity 


al (x) =T(a +1), 


that can be used repeatedly to increase the argument until we reach the region 
where Stirling’s asymptotic expansion is sufficiently accurate, see §4.5. 


| 
4.3.1 Repeated use of a doubling formula 


If we apply the doubling formula (4.18) for the exponential function k times, 
we get 


exp(x) = exp(a/2*)?". 


Thus, if |z| = O(1), we can reduce the problem of evaluating exp(z) to that of 
evaluating exp(«/2*), where the argument is now O(2—*). This is better since 
the power series converges more quickly for «/2*. The cost is the k squarings 
that we need to reconstruct the final result from exp(2/2"). 

There is a trade-off here, and k should be chosen to minimize theltotaLdime. 
If the obvious method for power series evaluation is used, then the optimal k is 
of order \/n and the overall time is O(n!/? M(n)). We shall see in §4.4.3 that 
there are faster ways to evaluate power series, so this is not the best possible 
result. | 

We assumed here that |z| = O(1). A more careful analysis shows that the 
optimal k depends on the order of magnitude of x (see Exercise 4.5). 


4.3.2 Loss of precision 


For some power series, especially those with alternating signs, a loss of pre- 
cision might occur due to a cancellation between successive terms. A typical 
example is the series for exp(x) when x < 0. Assume for example that we 
want ten significant digits of exp(—10). The first ten terms a" /k! for 2 = —10 
are approximately: 


1., —10.,50., —166.6666667, 416.6666667, —833.3333333, 1388.888889, 
—1984.126984, 2480.158730, —2755.731922. 


Note that these terms alternate in sign and initially increase in magnitude. They 
only start to decrease in magnitude for k > |z|. If we add the first 51 terms 
with a working precision of ten decimal digits, we get an approximation to 
exp(—10) that is only accurate to about three digits! 
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A much better approach is to use the identity 


exp(z) = 1/exp(—2) 
DC 

to avoid cancellation in the powdr serfs surhmation. In other cases, a different 

power series without sign changes might exist for a closely related function: 

for example, compare the series (4.22) and (4.23) for computation of the error 

function erf(x). See also Exercises 4.19-4.20. 


4.3.3 Guard digits 


Guard digits are digits in excess of the number of digits that are required in 
the final answer. Generally, it is necessary to use some guard digits during a 
computation in order to obtain an accurate result (one that is correctly rounded 
or differs from the correctly rounded result by a small number of units in the 
last place). Of course, it is expensive to use too many guard digits. Thus, care 
has to be taken to use the right number of guard digits, i.e. the right working 
precision. Here and below, we use the generic term “guard digits”, even for 
radix 3 = 2. 

Consider once again the example of exp x, with reduced argument 7/2" and 
x = O(1). Since x/2" is O(2—*), when we sum the power series 1+2/2*+--- 
from left to right (forward summation), we “lose” about k bits of precision. 
More precisely, if x/ 2* is accurate to n bits, then 1 +2 / 2* is accurate ton +k 
bits, but if we use the same working precision n, we obtain only n correct bits. 
After squaring k times in the reconstruction step, about &; bits will be lost (each 
squaring loses about one bit), so the final accuracy will be only n — k bits. If 
we summed the power series in reverse order instead (backward summation), 
and used a working precision of n + k when adding 1 and 2/2" + --- and 
during the squarings, we would obtain an accuracy of n + k bits before the k 
squarings, and an accuracy of n bits in the final result. 

Another way to avoid loss of precision is to evaluate expm1(a/2"), where 
the function expm1 is defined by 


expm1(x) = exp(x) — 1 


and has a[doutfling formula that avoids loss of significance when |z| is small. 
See Exercises 4.7-4.9. 
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4.3.4 Doubling versus tripling 
Suppose we want to compute the function sinh(a) = (e* — e~*)/2. The 
obvious doubling formula for sinh, 
sinh(2x) = 2 sinh(2) cosh(x), 


involves the auxiliary function cosh(7) = (e? + e~*)/2. Since cosh?(x) — 
sinh? (a2) = 1, we could use the doubling formula 


sinh(2x) = 2sinh(2)\/1+ sinh?(z), 


but this involves the overhead of computing a square root. This suggests using 
the tripling formula 


L sinh(3x) = sinh(a)(3 + 4sinh?()). (4.19) 


However, it is usually more efficient to do argument reduction via the doubling 
formula (4.18) for exp, because it takes one multiplication and one squaring 
to apply the tripling formula, but_only two squarings to apply the doubling 
formula twice (and 3 < 27). Ad ck is loss of precision, caused by can- 
cellation in the computation of exp(a) — exp(—2:), when || is small. In this 
case, it is better to use (see Exercise 4.10) 


sinh(x) = (expm1(x) — expm1(—2))/2. (4.20) 


See $4.12 for further comments on doubling versus tripling, especially in the 
FFT range. 


4.4 Power series LJ 


Once argument reduction has been applied, where possible (§4.3), we are usu- 
ally faced with the evaluation of a power series. The elementary and special 
functions have power series expansions such as 


J —1)izIt1 
expr=).—, ae y 


i : 
jzo po «(Cth 
—])iy2541 2j+1 
arctan x = ‘a aa > sinha = >. aca > ete. 
me Oe $50 29 + D! 


This section discusses several techniques to recommend or to avoid. We use 
the following notations: x is the evaluation point, n is the desired precision, 
and d is the number of terms retained in the power series, or d— 1 is the degree 
of the corresponding polynomial )°, <j<a %i x, 
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If f(a) is analytic in a neighborhood of some point c, an obvious method to 
consider for the evaluation of f(a) is summation of the Taylor series 


d—1 


De 
Fa) = w-of' HO + Rale.0) 


j=0 


As a simple but instructive example, we consider the evaluation of exp(zx) 
for |x| < 1, using 


exp(x) = — + Ra(z), (4.21) 


here |Ra(x)| < lel Ppp (lel) /a! <e/d. 

Using Stirling’s approximation for d!, we see that d > K(n) ~ n/lgn is 
sufficient to ensure that | Ra(x)| = O(2~"). Thus, the time required to evaluate 
(4.21) with Horner’s rule* is O(n M(n)/ log n). 

In practice, it is convenient to sum the series in the forward direction 
(j =0,1,...,d—1). The terms t; = x1 /j! and partial sums 


may be generated by the recurrence t; = xtj-1/j, S; = S;-1 +t;, and the 
summation terminated when |tg| < 2~”/e. Thus, it is not necessary to estimate 
d in advance, as it would be if the series were summed by Horner’s rule in the 
backward direction (j = d—1,d— 2,...,0) (see however Exercise 4.4). 

We now consider the effect of rounding errors, under the assumption that 
floating-point operations are correctly rounded, i.e. satisfy 


o(x op y) = (x op y)(1 +9), 


where |6| < € and “op” = “+”, “—”, “x” or “/”. Here ¢ = 2" is the “machine 
precision” or “working precision”. Let t; be the computed value of t;, etc. 


Thus 


[tj — ty / Ityl < 2ge + Ole”) 


4 By Horner’s rule (with argument x), we mean evaluating the polynomial 
so = Do<j<a a,;x/ of degree d (not d — 1 in this footnote) by the recurrence sq = aq, 
8; =a; + 83412 forj =d—1,d—2,...,0. Thus, s, = Dek<j<a ajxi—*, An 
evaluation by Horner’s rule takes d additions and d multiplications, and is more efficient than 
explicitly evaluating the individual terms a; 2 . 
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and using )y4_ ot; =Sa<e 


d 
|Sa—Sa| < dee + S~ 2jelt;| + Ole?) 
j=l 
< (d+ 2)ee + O(c”) = O(ne). 


Thus, to get |Sa— S| = O(2~"), itis sufficient that e = O(2~"/n). In other 
words, we need to work with about lg n guard digits. This is not a significant 
overhead if (as we assume) the number of digits may vary dynamically. We 
can sum with 7 increasing (the forward direction) or decregsmg (the backward 
direction). A slightly better error bound is obtainable for summation in the 
backward direction, but this method has the disadvantage that the number of 
terms d has to be decided in advance (see however Exercise 4.4). 

In practice, it is inefficient to keep the working precision ¢ fixed. We can 
profitably reduce it when computing ¢; from t;—1 if |t;—-1| is small, without 
significantly increasing the error bound. We can also vary the working preci- 
sion when accumulating the sum, especially if it is computed in the backward 
direction (so the smallest terms are summed first). 

It is instructive to consider the effect of relaxing ouurredtriction that ja] <1. 
First suppose that « is large and positive. Sinck_{t]| > |t;-1| when 7 < |al, it 
is clear that the number of terms required in the sum (4.21) is at least of order 
|x|. Thus, the method is slow for large |x| (see $4.3 for faster methods in this 
case). 

If |x| is large and « is negative, the situation is even worse. From Stirling’s 
approximation we have 


exp |r| 
/ 27 |x| 
but the result is exp(—|z]|), so about 2|x|/log 2 guard digits are required to 
compensate for what Lehmer called “catastrophic cancellation” [94]. Since 
exp(x) = 1/exp(—«), this problem may easily be avoided, but the corre- 


sponding problem is not always so easily avoided for other analytic functions. 
Here is a less trivial example. To compute the error function 


2 . 2 
erf(x) = Vi i e" du, 


we may use either the power series 


max |t;| ~ 
j20 0° 


co 


erf (x Fe i oe ay (4.22) 
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or the (mathematically, but not numerically) equivalent 


Lo)? x 29 yi LC] 
= erf (x) Ti » PEEERRICTES) Cc] 23) 
For small ||, the series (4.22) is slightly faster than the series (4.23) because 
there is no need to compute an exponential. However, the series (4.23) is prefer- 
to (4.22) for moderate || because it involves no cancellation} For large 
|x|, neither series is sptistatory, because ((x?) terms are required, and in this 
case it is preferable to use the asymptotic expansion for erfc(x) = 1 — erf (a): 
see §4.5. In the borderline region, use of the continued fraction (4.40) could be 
considered: see Exercise 4.31. 
In the following subsections, we consider different methods to evaluate power 


series. We generally ignore the effect of rounding errors, but the results 
obtained above are typical. 


Assumption about the coefficients 
We assume in this section that we have a power series > j>0 ajax, where 
a;45/a; is a rational function R(j) of 7, and hence it is easy to evaluate 
do, 41, 42,... sequentially. Here 6 is a fixed positive constant, usually 1 or 
2. For example, in the case of exp xz, we have 6 = 1 and 


ajy1 1 


a;  (j+1)! pH 


Our assumptions cover the common case of hypergeometric functions. For the 


more general case of holonomic functions, see §4.9.2. 

In common cases where our assumption is invalid, other good methods are 
available to evaluate the function. For le, tan x does not satisfy our as- 
sumption (the coefficients in its Taylor series are called tangent numbers and 
are related to Ili numbers — see §4.7.2), but to evaluate tan we can 
use Newton’s on the inverse function (arctan, which does satisfy our 
assumptions — see §4.2.5), or we can use tan x = sin x/ COS &. 


The radius of convergence 
If the elementary function is an entire function (e.g. exp, sin), then the power 
series converges in the whole complex plane. In this case, the degree of the 
denominator of R(j) = a;+1/a; is greater than that of the numerator. 
In other cases (such as ln, arctan), the function is not entire. The power 
series only converges in a disk because the function has a singularity on the 
boundary of this disk. In fact, In(a) has a singularity at the origin, which is 
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why we consider the power series for In(1 + a). This power series has radius 
of convergence 1. 

Similarly, the power series for arctan(x) has radius of convergence 1 be- 
cause arctan(x) has singularities on the unit circle (at +7) even though it is 
uniformly bounded for all real x. 


4.4.1 Direct power series evaluation 
CL] 


Suppose that we want to evaluate a power series }) 55. 4; x at a given argu- 
ment x. Using periodicity (in the cases of sin, cos) and/or argument reduction 
techniques (§4.3), we can often ensure that || is sufficiently small. Thus, let 
us assume that |a| < 1/2 and that the radius of convergence of the series is at 
least 1. 

As above, assume that a;+5/a; is a rational function of j, and hence easy 
to evaluate. For simplicity, we consider only the case 6 = 1. To sum the series 
with error O(2~”) it is sufficient to take m + O(1) terms, so the time required 
is O(nM(n)). If the function is entire, then the series converges faster and the 
time is reduced to O(nM(n)/(log n)). However, we can do much better by 
carrying the argument reduction further, as demonstrated in the next section. 


4.4.2 Power series with argument reduction 


Consider the evaluation of exp(«). By applying argument reduction k + O(1) 
times, we can ensure that the argument x satisfies |x| < 2-* Then, to obtain n- 
bit accuracy we only need to sum O(n/k) terms of the power series. Assuming 
that a step of argument reduction is O(M/(n)), which is true for the elementary 
functions, the total cost is O((k -+n/k)M (n)). Indeed, the argument reduction 
and/or reconstruction requires O(k:) steps of O(/(n)), and the evaluation of 
the power series of order n/k costs (n/k)M/(n); so choosing k ~ n!/? gives 
cost 


O (n'/?M(n)) 
For example, our comments apply to the evaluation of exp(x) using 
exp(x) = exp(x/2)?, 


to loglp(«) = In(1 + «) using 


4 Of 
log1p(«) = 2loglp { ———— }, 
oglp(x) = 2 log (SS) 
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and to arctan(x) using 


+4 be 
_ arctan x = 2 arctan (; ra | : 
Note that in the last two cases each step of the argument reduction requires 
4 sq}farejroot, but this can be done with cost O(M/(n)) by Newton’s method 
(§3.5). Thus, in all three cases the overall cost is O(n!/?M(n)), although the 
implicit constant might be smaller for exp than for log1p or arctan. See Exer- 
cises 4.8-4.9. 


Using symmetries 


A not-so-well-known idea is to evaluate In(1 + x) using the power series 


l+y yitl 
In (=) =2 rer] 

—y j>0 J+ 
with y defined by (1 + y)/(l-—y) = l+a,ie.y = x/(2 + 2). This 
saves hal terms and also reduces the argument, since y < 1/2 if x > 0. 


Unfortunately, this nice idea can be applied only once. For a related example, 
see Exercise 4.11. 


4.4.3 Rectangular series splitting 


Once we determine how many terms in the power series are required for the 
desired accuracy, the problem reduces to evaluating a truncated power series, 
i.e. a polynomial. 

Let P(t) = Yo<jca a;x! be the polynomial that we want to evaluate, 
deg(P) < d. In the general case, x is a floating-point number of n bits, 
and we aim at an accuracy of n bits for P(x). However, the coefficients aj, 
or their ratios R(j) = aj+41/a;, are usually small integers or rational num- 
bers of O(log) bits. A scalar multiplication involves one coefficient a; and 
the variable x (or more generally an n-bit floating-point number), whereas a 
non-scalar multiplication involves two powers of x (or more generally two n- 
bit floating-point numbers). Scalar multiplications are cheaper because the a; 
are small rationals of size O(log n), whereas x and its powers generally have 
O(n) bits. It is possible to evaluate P(x) with O(,/n) non-scalar multiplica- 
tions (plus O(n) scalar multiplications and O(n) additions, using O(./7) stor- 
age). The same idea applies, more generally, to evaluation of hypergeometric 
functions. 
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Classical splitting 
Suppose d = jk, define y = x", and write 


jal k-1 
P(x) = Soy Pi(x), where P(x) = \ Ake¢m u. 
£=0 m=0 


We first compute the powers x”, 2°,..., ctl a = y, then the polynomials 


P;(x) are evaluated simply by multiplying a,¢;,, and the precomputed 2” (it 
is important not to use Horner’s rule here, since this would involve expensive 
non-scalar multiplications). Finally, P(a) is computed from the Pp(a) using 
Horner’s rule with argument y. To see the idea geometrically, write P(x) as 


0 2 k-1 
y” [ao + air + a2x ad Qpn-1z" | + 
rT 2 k-1 
y° [ak =+ apie ac An+20 zee? GQon-10") + 
2 2 k-1 
y” [ark + A2k412 + A2k+2L Nee agr-10" ] + 
jaa 2 k-1 
y [ag-ake + @g-aegi® + aj-iypeo® vee ajr-12" "], 


where y = x. The terms in square brackets are the polynomials P(x), Pi (2), 
Sahaty P;-1(«). 

As an example, consider d = 12, with j = 3 and k = 4. This gives Po(x) = 
ag+ayz+ag27 +a3x°, P(x) = a4+a5x+agx? +a7x?, Po(x) = agtagx+ 
ayox? +4112, then P(x) = Po(x) + yPi(x) + y? Po(x), where y = x*. Here 
3, x*, which requires three non-scalar products — 
note that even powers like «+ should be computed as (x7)? to use squarings 


we need to compute x7, x 


instead of multiplies — and we nedd two non-scalar products to evaluate P(x); 
thus, a total of five non-scalar products, instead of d— 2 = 10 with a naive 
application of Horner’s rule to P(x).° 


Modular splitting 


An alternate splitting is the following, which may be obtained by transpos- 
ing the matrix of coefficients above, swapping 7 and k, and interchanging the 
es of x and y. It might also be viewed as a generalized odd—even scheme 


(81.3.5). Suppose as before that d = jk, and write, with y = x/: 


j-l1 k-1 
P(x) = So Ply), where Pe(y) = s Ajm4ey”™. 
£=0 m=0 


5 P(a) has degree d — 1, so Horner’s rule performs d — 1 products, but the first one @ X aq—1 
is a scalar product, hence there are d — 2 non-scalar products. 
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First compute y = x7, y?,y?,...,y"—!. Now the polynomials Pe(y) can be 
evaluated using only scalar multiplications of the form ajm4e x y™. 
To see the idea geometrically, write P(x) as 


x° [ag + ayy + aay? Fo oe] + 

x? [ay + Ggay A Mayer ee] 

a lag + ajyoy + arjyay? + ++] + 
a" fag-4 + Gog-iy + Gayay? + ++], 


where y = «J. We traverse the first row of the array, then the second row, then 
the third, ..., finally the jth row, accumulating sums Sp, 5},...,.5;—-1 (one 
for each row). At the end of this process, S¢ = Pe(y), and we only have to 
evaluate 


P(x) = oo =| 
L=0 


The complexity of each scheme is almost the same (see Exercise 4.12). With 
d = 12 (j = 3and k = 4) we have Po(y) = ao + asy + aey? + agy?, 
Py(y) = a1 + aay + azy? + ayoy?, Poy) = a2 + asy + agy? + ary’. 
We first compute y = x?, y?, and y?, then we evaluate Po(y) in three scalar 
multiplications a3y, agy”, and agy? and three additions, similarly for P; and 
P». Finally we evaluate P(x) using 


P(x) = Po(y) + ePy(y) + 27 Po(y), 


(here we might use Horner’s rule). In this example, we have a total of six non- 
scalar multiplications: four to compute y and its powers, and two to evaluate 
P(x). 


Complexity of rectangular series splitting 

To evaluate a polynomial P(x) of degree d— 1 = jk — 1, rectangular series 
splitting takes O(j + k) non-scalar multiplications — each costing O(M(n)) - 
and O(jk) scalar multiplications. The scalar multiplications involve multipli- 
cation and/or division of a multiple-precision number by small integers. As- 
stime_that these multiplications and/or divisions take time c(d)n each (see Ex- 
ercise 4.13 for a justification of this assumption). The function c(d) accounts 
for the fact that the involved scalars (the coefficients a; or the ratios aj+1/a;) 
have a size depending on the degree d of P(x). In practice, we can usually 
regard c(d) as constant. 
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Choosing j ~ k ~ d'/?, we get overall time 


QttAM (n) +dn-c(d)). (4.24) 


If d is of the same order as the precision n of x, this is not an improvement 
on the bolind O(n'/?M(n)) that we obtained already by argument reduction 
and power series evaluation (§4.4.2). However, we can do argument reduction 
before applying rectangular series splitting. Assuming that c(n) = O(1) (see 
Exercise 4.14 for a detailed analysis), the total complexity is 


T(n) =O (Sum) +d/2M(n)+ dn) 


where the extra (n/d)M(n) term comes from argument reduction and/or re- 
construction. Which term dominates? There are two cases: 


1. M(n) >> n*/3. Here the minimum is obtained when the first two terms — 
argument reduction/reconstruction and non-scalar multiplications — are 
equal, i.e. for d ~ n?/9, which yields T(n) = O(n1/3M(n)). This case 
applies if we use classical or Karatsuba multiplication, since lg3 > 4/3, 
and similarly for Toom—Cook 3-, 4-, 5-, or 6-way multiplication (but not 
7-way, since log, 13 < 4/3). In this case, T(n) > 5/3. 

2. M(n) < n*/3. Here the minimum is obtained when the first and the last 
terms — argument reduction/reconstruction and scalar multiplications — are 
equal. The optimal value of d is then ,/M(n), and we fet hn improved 
bound @(n\/M(n)) >> n3/?. We cannot approach the O(n'*®) that is 
achievable with AGM-based methods (if applicable) — see §4.8. 


4.5 Asymptotic expansions 


Often it is necessary to use different methods to evaluate a special function in 
different parts of its domain. For example, the exponential integral® 


Ei (x) = / UY i (4.25) 
2 u 
is defined for all 2 > 0. However, the power series 
Sig 
Bile) +7 tine = > CU (4.26) 
j=l ; 


6 Bi (a) and Ei(x) = PV f*...(exp(t)/t) dé are both called “exponential integrals”. Closely 
related is the “logarithmic integral” li(a) = Ei(In 2) = PV fj (1/ Int) dé. Here the ana 
PV --- should be interpre Cauchy principal values if there is a singularity in t ge 
of integration. The power series (4.26) is valid for « € C if | arg x| < 7 (see Exercise 4.16). 


CL] 
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is unsatisfactory as a means of evaluating E;(2) for large positive x, for the 
reasons discussed in §4.4 in connection with the power series (4.22) for erf(), 
or the power series for exp() (x negative). For sufficiently large positive «, it 
is preferable to use 


k : 4 
<i 
é* Ei(«) = >> G si v4 R,(2), (4.27) 
j=l 
where 
R(x) = k! (—1)* exp(z) / See) du. (4.28) 
Note that 
k! 
|Ri(x)| < TPH’ 
so 
lim _ R(x) =0, 


but limp... Rx (a) does not exist. In other words, the series 


SG - 1)! (-1)971 
= uN) 


is divergent. In such cases, we call this an asymptotic series and write 


7 —1)!(—1)3-1 
e* Ey(x)~ > Ga : (4.29) 
j>0 
Although they do not generally converge, asymptotic series are very useful. 
Often (though not always!) the error is bounded by the last term taken in the 
series (or by the first term omitted). Also, when the terms in the asymptotic 
series alternate in sign, it can often be shown that the,truevalue lies between 
two consecutive approximations obtained by se (as a with (say) k 
and k + 1 terms. For example, this is true for the series (4.29) above, provided 
x is real and positive. 
When « is large and positive, the relative error attainable by using (4.27) 
with k = |x| is O(x!/? exp(—z)), because 


|Ri(k)| << kl/ke op ee) (4.30) 


and the leading term on the right side of (4.27) is 1/a. Thus, the asymptotic 
series may be used to evaluate E,(2) to precision n whenever 
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x > nln2 + O(Inn). More precise estimates can be obtained by using a 
version of Stirling’s approximation with error bounds, for example 


(Sa <ki< (‘) V2rk exp (sx) to 


If x is too small for the asymptotic approximation to be sufficiently accurate, 
we can avoid the problem of cancellation in the power series (4.26) by the 
technique of Exercise 4.19. However, the asymptotic approximation is faster 
and hence is preferable whenever it is sufficiently accurate. 

Examples where asymptotic gasage ny ek useful include the evaluation of 
erfc(x), (a), Bessel functions, etc. We discuss some of these below. 

Asymptotic expansions often arise when the convergence of series is accel- 
erated by the Euler—Maclaurin sum formula’ For example, Euler’s constant y 


is defined by 
y= lim (Hy —InN), C431) 


—Co 
where Hy = 7, <j<n i /j is a harmonic number. However, Eqn. (4.31) con- 
verges slowly, so to evaluate y accurately we need to accelerate the conver- 
gence. This can be done using the Euler—Maclaurin formula. The idea is to 
split the sum Hy into two parts 


aa 
Hy = Hyatt 
j=p 7 


We approximate the second sum using the Euler—Maclaurin formula’ with 
a=p,b=N, f(x) =1/z, then let N — +o0. The result is 


B : 
y ~ Hp —Inp+ S> hy F, (4.32) 


If pa number of terms in the asymptotic expansion are chosen judi- 
ciously; gives a good algorithm for computing 7 (though not the best algo- 
rithm: see §4.12 for a faster algorithm that uses properties of Bessel functions). 


7 The Euler—Maclaurin sum formula is a way of expressing the difference between a sum and 
an integral as an asymptotic expansion. For example, assuming that a € Z, b € Z, a < b, and 
f (x) satisfies certain conditions, one form of the formula is 


? f(a) + f(e) Bor ( p(2k-1) (2k-1) 
f(k)— | f(x)da~ f (6) —f (a))- 
oe, i, > * 2 any ( ) 


Often we ahiged — +oo and omit the terms involving b on the right-hand side. For more 
information, see §4.12. 
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Here is another example. The Riemann zeta-function ¢(s) is defined for 
s €C, Rs) > 1, by 
(=) 77% (4.33) 
j=l 


and by analytic continuation for other s # 1. ¢(s) may be evaluated to any 
desired precision if m and p are chosen large enough in the Euler—Maclaurin 


formula 
pol ps pis Mes 
G(s) = DIP + + a th Theol) + Emols), 4.34) 
j=l k=1 
where 
Bop ee a 
Ti.ol3) = apy IT +9) (4.35) 
| it 
|Em.p(8)| < |Im+4i,p(s) (s + 2m+4 1)/(o + 2m + 1)I, (4.36) 


m>0,p> 1,0 =R(s) > —(2m + 1), and the By; are Bernoulli numbers. 
In arbitrary-precision computations, we must be able to compute as many 
terms of an asymptotic expansion as are required to give the desired accuracy. 
I(Gs esy to see that, if m in (4.34) is bounded as the precision n goes to 
oo, then p has to increase as an exponential function of n. To evaluate ¢(s) 
from (4.34) to prgcision n in time polynomial in n, both m and p must tend to 
infinity with n. Thus, the Bernoulli numbers B2,..., Bz, cannot be stored in 
a table of fixed size> but must be computed when needed (see §4.7). For this 
reason, we cannot use asymptotic expansions when the general form of the 
coefficients is unknown or the c ients are too difficult to evaluate. Often 
there is a related expansion with relatively simple coefficients. For 
example, the asymptotic expansion ie ees I(x) has coefficients r to 
the Bernoulli numbers, like the expansion (4.34) for ¢(s), and thus is i to 

implement than Stirling’s asymptotic expansion for ['(a) (see Exercise4+-42). 
der tht-cont putation of the error function erf(x). As seen in §4.4, the 
series (4.22) and (4.23) are not satisfactory for large ||, since they require 
Q(x?) terms. For example, to evaluate erf (1000) with an accuracy of six digits, 


8 In addition, we would ha ore them as i ea eer taking ~ m2 lg m bits of storage, 
since a floating-point rep tion would n nvenient unless the target precision n 
were known in advance. See §4.7.2 and Exercise 4.37. 


—_ 
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Eqn. (4.22) requires at least 2718 279 terms! Instead, we may use an asymp- 
totic expansion. The complementary error function erfc(x) = 1 — erf(a) sat- 
isfies 


erfe(x) ~ = >i 1 CD! ogy -25, (4.37) 


with the error bounded in absolute value by the next term and of the same sign. 
In the case x = 1000, the term for 7 = 1 of the sum equals —0.5 x 10~°®; thus, 
ee /(a/7) is an approximation to erfc(a) with an accuracy of six digits. 
Because erfc(1000) ~ 1.86 x 1048429 is very small, this gives an extremely 
accurate approximation to erf(1000). 

For a function like the error function, where both a power series (at « = 0) 
and an asymptotic expansion (at 7 = oo) are available, we might prefer to use 
the former latter, depending on the value of the argument and on the 
desired pre . We study here in some detail the case of the error function, 
since it is typical. 

The sum in (4.37) is divergent, since its jth term is ~ /2(j/er?)J. We 
need to show that the smallest term is O(2~”) in order to be able to deduce 
an n-bit approximation to erfc(x). The terms decrease while 7 < x? + 1/2, 
so the minimum is obtained for 7 ~ x’, and is of order en: thus, we need 
x > Vn\n2. For example, forn = 10° bits this yields x > 833. However, 
since erfc(z) is small for large x, say erfc(x) + 2~*, we ne€d only m 
correct bits of erfc(x) to get n correct bits of erf(x) = 1 — erfe(x). 

Consider x fixed and j varying in the terms in the sfims (4.22) and (4.37). 
For j < 2, x74 /j! is anfincreasing farctjon of j, but (23)!/(j\(4x?)4) is a 
decreasing function of 7. In this region, the terms in Eqn. (4.3 decreasing. 
Thus, comparing the series (4. d (4.37), we see that the latter should 
ee used if } give sufficient accuracy. Similarly, (4.37) should if 
possible De used in preference to (4.23), as the magnitudes of corresponding 
terms in (4.22) and in (4.23) are similar. 

Algorithm Erf computes erf (a) for real positive x (f er real x, 
fact that erf(x) is an odd function, so erf(—a7) = — a ae erf ca 
In Algorithm Erf, the number of terms needed if Eqn. (4.22) or Eqn. (4.23) 
is used is approximately the unique positive root jp (rounded up to the next 


integer) of 
j(Inj — 2Inz —1) = nIn2, 


sO jo > ex”. On the other hand, if Eqn. (4.37) is used, then the summation 
bound k is less than 2? + 1/2 (since otherwise the terms start increasing). The 
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Algorithm 4.2 Erf 
Input: positive floating-point number 2, integer n 
Output: an n-bit approximation to erf (x) = 
m — [n— (a? + Ine +4 (Inz)/2)/(In2)] 
if (m + 1/2) n(2) < x? then 
t — erfc(x) with the asymptotic expansidn (4.37) and precision m 
return 1 — ¢ (in precision 7) 
else if x < 1 then 
compute erf(x) with the power series (4.22) in precision n 
else 
compute erf(x) with the power series (4.23) in precision n. 


ition (m + 1/2) In(2) < 2? in the algorithm ensures the asymptotic 
mele can give m-bit accuracy. 

Here is an example: for 2 = 800 and a precision of one million bits, Equa- 
tion (4.23) requires abdutiq = 2339601 terms. Eqn. (4.37) tells us that 
erfe(r) © 24223335: thus, we need only m = 76665 if precision for 
erfc(a) — in this case Eqn. (4.37) requires only about k = 10375 terms. Note 
that using Eqn. (4.22) would be slower than using Eqn. (4.23), becailse we 
would have to compute about the same number of terms, but with higher pre- 
cision, to compensate for cancellation. We recommend using Eqn. (4.22) only 
if || is small enough that any cancellation is insignificant (for example, if 
|x| <1). 

Another example, closer to the boundary: for 7 = 589, still with n = 10°, 
we Waveqn = 499 489, pric gives jo = 1497924, and k = 325092. For 
somewhat smaller x (or larger n), it might be desirable to use the continued 
fraction (4.40), see Exercise 4.31. 

Occasionally, an asymptotic expansion can be used to obtain arbitrarily high 
precision. For example, consider the computation of In ['(). For large positive 
x, we can use Stirling’s asymptotic expansion 


1 _in(27) , are Box 
InT(2) = (« 5) Ing — «x4 a Lx Dk(2h — 1)a2h=1 Rn(z), 


k=1 
(4.38) 
where R,,,(x) is less in absolute value than the first term neglected, i.e. 


Bom 
Im(2m — 1)a2"—1" 


U 
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and has the same sign? The ratio of successive terms t, and t;,.1 of the sum is 


so the terms start to increase in absolute value for (approximately) k > 7a. 
This gives a bound on the accuracy attainable, in fact 


In|Rm(x)| > —2rx In(x) + O(a). 


However, because Fz] psatisfies the functional equation [(a@ + 1) = aP(x), 
we can take x’ = x + 6 for some sufficiently large 6 € N, evaluate InI'(z’) 


using the asymptotic expansion, and then compute In '(:) from the functional 
equation. See Exercise 4.21. 


C4 


4.6 Continued fractions 


In 84.5, we considered the exponential integral E,(x). This can be computed 
using the continued fraction 


e* Ei (2) = 


1+ 
x+ 


3 
Uehads 


Writing continued fractions in this way takes a lot of space, so instead we use 
the shorthand notation 


111 2 2 8 


07 = 4.39 
ey ee ae back 1 oe 
Another example is 

2 \ 1 1/2 2/2 3/2 4/2 5/2 

erfe(x) = { £ eee EE as (4.40) 

Ja} e+ + 2+ e+ ¢4+ 4 

Formally, a continued fraction 
ay ag a3 a 
=b)-4 Fac, 
f= bo Bae bet Bye 


9 The asymptotic expansion is also valid for « € C, | arg2| < 7,2 dy, but the bound on the 
error term R,,, (a) in this case is more complicated. See for example [1, 6.1.42]. 
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is defined by two sequences (a;);en* and (b;)j;en, where a;, b; € C. Here 
C = CU {oo} is the set of extended complex numbers!° The expression f is 
defined to be limz_..0 fx, if the limit exists, where 

a1 42 43 ak 
-=bo4 Riad 
Te = Bo bce tot bats By 


is the finite continued fraction — called the kth approximant — obtained by 
frunchting the infinite continued fraction after k quotients. 

Sometimes cont{nued fractions are preferable, for computational purposes, 
to power series or asymptotic expansions. For exampl er’s continued frac- 
tion (4.39) converges fo al x > O, and is better for co tion of Ey (x) 
than the power series pas the region where the power series suffers from 
catastrophic cancellation but the asymptotic expansion (4.27)4s-net sufficiently 
accurate. Convergence of (4.39) is slow if x is small, so aes is preferred 
for precision n evaluation of Ey (a) only when « is in a certain interval, say 
x E (cin, can), c) © 0.1, co = In2 © 0.6931 (see Exercise 4.24). 

Continued fractions may be evaluated by either forward or backward recur- 
rence relations. Consider the finite continued fraction 


(4.41) 


a, a2 a a 
it re am ia _ by vee) 
The backward recurrence is Ry = 1, Ry_—1 = by, 
Rj = bj4i Ryo + aj42 Rj+2 (G=Hk—2,2.2;0), (4.43) 
and y = a; R,/ Ro, with invariant 
A a EM, 2 Oe, 
Rj-1 bjt dja t by, 
The forward recurrence is Py) = 0, P, = a1, Qo = 1, Q1 = 01, 
pa ie BR ee (j=2,...,k), (4.44) 
Q; = bj Qj) fi F 4; Qj-2 
and y = P;,/Qx (see Exercise 4.26). LJ 


The advantage of evaluating an infinite continued fraction such as (4.39) via 
the forward recurrence is that the cutoff k need not be chosen in advance; we 
can stop when |D,,| is sufficiently small, where 


(4.45) 


10 Arithmetic operations on C are extended to C in the obvious way, for example 
1/0 =1+co=1 xX c= 00, 1/oo = 0. Note that 0/0, 0 x 00 and co + on are undefined. 
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The main disadvantage of the forward recurrence is that twice as many arith- 
metic operations are required as for the backward recurrence with the same 
value of . Another disadvantage is that the forward recurrence may be less 
numerically stable than the backward recurrence. 

If we are working with variable-precision floating-point arithmetic, which is 
much more expensive than single-precision floating-point, then a useful strat- 
egy is to use the forward recurrence with single-precision arithmetic (scaled to 
avoid overflow/underflow) to estimate k, and then use the backward recurrence 
with variable-precision arithmetic. One trick is needed: to evaluate D,, using 
scaled single-precision we use the recurrence 


Dy = a/b, =| 


; ‘ (4.46) 
D; = —a;Q;-2D;-1/Q; © 7 5s ee 
which avoids the cancellation inherent in (4.45). 
By analogy with the case of power series with decreasi s that alternate 
in sign, there is one case in which it is possible to give le a posteriori 


bound for the error occurred in truncating a continued fraction. Let f be a 
convergent continued fraction with approximants f;, as in (4.41). Then: 


Theorem 4.1 [fa; > 0 and b; > O forall j € N*, then the sequence ( fak)ken 
of even order approximants is strictly increasing, and the sequence ( for+1) ken 
of odd order approximants is strictly decreasing. Thus 


for < f < forti 


and 


fmt 5 tm 
an. 


for allm € N*. | 


tim _ dit 
qe 


In general, if the conditions of Theorem 4.1 are not satisfied, then it is diffi- 
cult to give simple, sharp error bounds. Power series and asymptotic series are 
usually much easier to analyse than continued fractions. 


4.7 Recurrence relations 


The evaluation of special functions by continued fractions is a special case 
of their evaluation by recurrence relations. To illustrate this, we consider the 
Bessel functions of the first kind, J,(a). Here v and x can in general be com- 
plex, but we restrict attention to the case v € Z, x € R. The functions J, (x) 
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can be defined in several ways, for example by the generating function (elegant 
but only useful for v € Z) 


+00 
exp (5 (: = *)) = So Pe, (4.47) 


or by the power series (also valid if v ¢ Z): 


(ay = (—a?/4)4 
J, (x) = (5) yy jITw@+j+l)_ (4.48) 


We also need Bessel functions of the second kind (sometimes called Neumann 
functions or Weber functions) Y,,(a), which may be defined by 
J,,(«) cos(tp) — J, (x) 


Both J, (a) and Y, (2) are solutions of Bessel’s differential equation 


xy” + ay! + (a? —v?)y = 0. (4.50) 


4.7.1 Evaluation of Bessel functions 


The Bessel functions J,,(a) satisfy the recurrence relation 
2v 
Jy_-1(2) + Jngi(a) = = v(@)- (4.51) 


Dividing both sides by J, (a), we see that 
Jy—1(@) _ v i/ J, (x) 


r) 


J, (x) x Jy4i(2) 
which gives a continued fraction for the ratio J, (x)/J,—1(x) (v > 1) 
J, (x) 1 1 1 


(4.52) 


Cail i(2)  Qv/a— 2(v+1)/2— 2(v4+2)/zr 
However, (4.52) is not immediately useful for evaluating the Bessel functions 
Jo(a) or Falad only gives their ratio. 

The recurrence (4.51) may be evaluated backwards by Miller’s algorithm. 
The idea is to start at some sufficiently large index v’, take f,4, = 0, fy. = 1, 
and evaluate the recurrence 


2 
frpitfus= — fas (4.53) 


— 
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backwards to obtain f,,_1,--- , fo. However, (4.53) is the same recurrence as 
(4.51), so we expect to obtain fo ~ cJo(a), where c is some scale factor. We 
can use the identity 


Jo(x) +2 S> Ja(a) =1 (4.54) 
to determine c. 

To understand why Miller’s algorithm works, and why evaluation of the re- 
currence (4.51) in the forward dirgctioy is numerically unstable for v > 2, 
w rve that the recurrence (4.53) has two independent solutions: the de- 
sired solution J,(), and an undesired solution Y,(x), where Y, (x) is a Bessel 
function of the second kind, see Eqn. (4.49). The general solution of the recur- 
rence (4.53) is a linear combination of the special solutions J,,(x) and Y,(a). 
Due to rounding errors, the computed solution will also be a linear combina- 
tion, say aJ, (x) + bY, (x). Since |Y_()| increases exponentially with v when 
v > ex/2, but |J,(x)| is bounded, the unwanted component will increase ex- 
ponentially if we use the recurrence in the forward direction, but decrease if 
we use it in the backward direction. 

More precisely, we have 


1 ex\¥ 2 (2v\" 


as vy — +00 with x fixed. Thus, when v is large and greater than ex/2, J, (a) 

is small and |) (@]| is large. 
Miller’s algorithm seems to be the most effective method in the region where 
Cpe power series (4.48) suffers from catastrophic cancellation, but asymptotic 


expansions are not sufficiently accurate. For more on Miller’s algorithm, see 
84.12. 


CJ 4(7.2 |Evalhatidn of Bernoulli and tangent numbers 


In 84.5, Eqns. (4.35) and (4.38), the Bernoulli numbers B2;, or scaled Bernoulli 
numbers Cy, = By;,/(2k)! were required. These constants can be defined by 
the generating functions 


ye x 
+ Bea" (4.56) 


=. 2k x x x/2 
: = bo = : 4.57 
om ad e™—1 2 tanh(2/2) van 


i | 
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Multiplying both sides of (4.56) or (4.57) by e” — 1, then equating coefficients, 
gives the recurrence relations 


k 
pa 
Bo=1, Y( +1) By; =0 for b>, (4.58) 
j=0 
and 
k 
C; i 
y @Qk+1—2)! 2m! 99) 


These recurrences, or slight variants with similar numerical properties, have 
often been used to evaluate Bernoulli numbers. 

In this chapter our philosophy is that the required precision is not known in 
advance, so it is not possible to precompute the Bernout4numbers and store 
them in a table a for all. Thus, we need a good a for computing 
them at runtime. 

Unfortunately, forward_evaluation of the recurrence (4.58), or the corre- 
sponding recurrence (4+ or the scaled Bernoulli numbers, is numerically 
unstable: using precision n, the relative in the computed Bo; or Cy is of 
order 4*2—": see Exercise 4.35. 

Despite its numerical instability, use of (4.59) may give the C;, to acceptable 
accuracy if they are only needed to generate coefficients in an Euler—-Maclaurin 
expansion whére thk successive terms diminish by at least a factor of four (or if 
the C), are computed using exact rational arithmetic). If the C,, are r(quirdd to 
precision n, then (4.59) should be used with sufficient guard digits, or (better) 
a more stable recurrence should be used. If we multiply both sides of (4.57) by 
sinh(a/2)/x and equate coefficients, we get the recurrence 


k 


oF 4 
aS (2k+1—2j)!4*-7 ~~ (2k)P 4 ae 


If (4.60) is used to evaluate C,, using precision n arithmetic, the relative 
error is only O(k?2-"). Thus, use of (4.60) gives a stable algorithm for 
evaluating the scaled Bernoulli numbers Ci, (and hence, if desired, the 
Bernoulli numbers). 

An even better, and perfectly stable, way to compute Bernoulli numbers is 
to exploit their relationship with the tangent numbers T;, defined by 


j=0 


2j-1 


x 
tang = ber @j—1! =7)! . (4.61) 


jz 


The tangent numbers are positive integers and can be expressed in terms of 
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Bernoulli numbers 
Boj 


Ty = (-1)9-12 (277 — 1) 
J 


(4.62) 


Conversely, the Bernoulli numbers can be expressed in terms of tangent 
numbers 


1 if j =0, 
Pe a ifj =1, 
C_] 7) (-1)9/2-15T;)9/(47 — 27) if j > 0 is even, 
0 otherwise. 


Eqn. (4.62) shows that the odd primes in the denominator of the Bernoulli 
number Bo; be divisors of 227 — 1. In fact, this is a consequence of 
Fermat’s |e ie and the Von Staudt—Clausen theorem, which says that 
the primes p dividing the denominator of B 2; are precisely those for which 
(p — 1)|27 (see $4.12). 

We now derive a recurrence that can be used to compute tangent numbers, 
using only integer arithmetic. For brevity, write = tana and D = d/dz. 
Then Dt = sec? x = 14 2”. It follows that D(t”) = nt”~1(1 +4 #”) for all 
ne N*. 

It is clear that D”t is a polynomial in t, say P,,(t). For example, Po(t) = t, 
P(t) = 148’, etc. Write P,(t) = 3°50 Pn,jt’. From the recurrence P,,(t) = 
DP,,_-1(#), and the formula for D(t”) just noted, we see that deg(P,,) = n+ 


and 
Sprit? = >> jpn-igt? 11+ 1), 
j20 j20 
Ne) 
Prl=J15 — 1)Pn—1,j3-1 + G + 1)Pn-1,j41 (4.63) 


for all n € N*. Using (4.63), it is straightforward to compute the coefficients 
of the polynomials P;(t), P2(t), ete. 

Observe that, since tan x is an odd function of x, the polynomials P>;(t) are 
odd, and the polynomials P2,+41(t) are even. Equivalently, p,,; = 0if n+ 7 is 
even. 

We are interestéd_ial the tangent numbers Tj, = P2,~1(0) = pox—1,0- 
Using the recurrence (4.63) but avoiding computation of the coefficients that 
are known to vanish, we obtain Algorithm TangentNumbers for the in-place 
computation of tangent numbers. Note that this algorithm uses only arithmetic 
on non-negative integers. If implemented with single-precision integers, there 
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Algorithm 4.3 TangentNumbers 
Input: positive integer m 
Output: Tangent numbers 7},..., Tin 
Ti, 1 
for k from 2 tom do 
Tr = (k = 1)Ty-4 
for k from 2 tom do 
for j from k to m do 
Ty — G—k)Tj-1 + (G— K+ 2); 
return 7;,7>,..., Zim. 


may, oblems with overflow as the tangent numbers grow rapidly. If imple- 
men ing floating-point arithmetic, it is numerically stable because there 
is no cancellation. An analogous algorithm SecantNumbers is the topic of 
Exercise 4.40. 

The tangent numbers grow rapidly because the generating function tan x has 
poles at 2 = +7/2. Thus, we expect T, to grow roughly like (2k—1)! (2/7)?*. 
More precisely 


tT  _ vera —2-**)\¢ On) 
(2k—1)! qk 


, (4.64) 


where ¢(s) is the usual Riemann zeta-function, and 
(Lao "espe Le aE oP eee 


is sometimes called the odd zeta-function. 

The Bernoulli numbers also grow rapidly, but not quite as fast as the tan- 
gent numbers, because the singularities of the generating function (4.56) are 
further from the origin (at +2im instead of +7/2). It is well-known that the 
Riemann zeta-function for even non-negative integer arguments can be 


expressed in terms of Bernoulli numbers — the relation is 


p—-1 Bor _ 2¢(2k) 


Gel > mye Ge) 
Since ¢(2k) = 1 + O(4-*) as k = +00, we see that 
2 (2k)! 
pin (4.66) 


= (ny 
[tis kasy to see that (4.64) and (4.65) are equivalent, in view of the rela- 
tion (4.62). 


= 
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An asymptotically fast way of computing Bernoulli numbers is the topic of 
Exercise 4.41. For yet another way of computing Bernoulli numbers, using 
very little space, see §4.10. 


4.8 Arithmetic-geometric mean 


The (theoretically) fastest known methods for very large precision n use the 
arithmetic-geometric mean (AGM) iteration of Gauss and Legendre. The AGM 
is another non-linear recurrence, important enough to treat separately. Its com- 
plexity is O(/(n) Inn); the implicit constant here can be quite large, so other 
methods are better for small n. 

Given (ag, bo), the AGM iteration is defined by 


core = (gy! Ve). 


For simplicity, we only consider real, positive starting values (ag, bo) here (for 
complex starting values, see 84.8.5 and §4.12). The AGM iteration converges 
quadratically to a limit that we denote by AGM(ao, bo). 

The AGM is useful because: 


1. It converges quadratically — eventually the number of correct digits doubles 
at each iteration, so only O(log 7) iterations are réquired. [~_] 

2. Each iteration takes time O(//(n)) because the square root can be com- 
puted in time O(/(n)) by Newton’s method (see §3.5 and §4.2.3). 

3. If we take suitable starting values (ao, bo), the result AGM(ao, bo) can be 
used to compute logarithms (directly) and other elementary functions (less 
directly), as well as constants such as 7 and In 2. 


4.8.1 Elliptic integrals 


The theory of the AGM iteration is intimately linked to the theory of elliptic 
integrals. The complete elliptic integral of the first kind is defined by 


(4.67) 


n/2 do 1 dt 
k@)= [ = | ae 
0 1—k?* sin“ @ ) (1 — ¢#)(1 — kt?) 


and the complete elliptic integral of the second kind is 


a /2 il 1 — k2#2 
E(k) =} Vi-esin?oao = [ -_-— dt 
0 0 


1-? 
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where k € [0,1] is called the modulus and k! = /1 — k? is the complemen- 
tary modulus. It is traditional (though confusing as the prime does not denote 
differentiation) to write K'(k) for K(k’) and E’(k) for E(k’). 


The connection with elliptic integrals. Gauss discovered that 


uae = * K'(h). (4.68) 
This identity can be used to compute the elliptic integral kK rapidly via the 
AGM iteration. We can also use it to compute logarithms. From the defini- 
tion (4.67), we see that A’(k) has a series expansion that converges for |k| < 1 
(in fact, K(k) = (7/2)F (1/2, 1/2; 1;k?) is a hypergeometric function). For 
small k, we have 


T he 4 
K(k) = 5 1+7+0k ys (4.69) 
It can also be shown that 
jax 2 4 k? 4 
K'(k) = fs In (=) K(k) £ O(k*). (4.70) 


az bind M algorithm for the logarithm 


From the formule (4.68), (4.69), and (4.70), we easily get 


4 2 
AGM(1, k) =n (=) (1+ O(K*)). (4.71) 


Thus, if 2 = 4/k is large, we have 


ine) = eat a7 (1 ee (zz) : 


If 2 > 2”/, we can compute In(x) to precision n using the AGM iteration. It 
takes about 21g(7n) iterations to converge if x € [2”/?,2”]. 

Note that we need the constant 7, which could be computed by using our 
formula twice with slightly different arguments x; and 4»,then taking differ- 
ences to approximate (dIn(a) /da)/7 at x; (see Exercise 4.44). More efficient 
is to use the Brent—Salamin (or Gauss—Legendre) algorithm, which is based on 
the AGM and the Legendre relation 


ER 4 ERK = - (4.72) 
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Argument expansion. If x is not large enough, we can compute 
In(2°x) = £In2+ Ina 


by the AGM method (assuming the constant In 2 is known). Alternatively, if 
x > 1, we can square x enough times and compute 


In (a) = 2 In(x). 


This method with z = 2 gives a way of computing InP, asJuming we already 
know 7. 


The error term. The O(k?) error term in the formula (4.71) is a nuisance. A 
rigorous bound is 


1/2 4 9 
l < 4k*(8 — Ink 4.7 
rani ~2(Z)| 6-4 oo 
for all k € (0, 1], and the bound can be sharpened to 0.37k?(2.4 — In(k)) if 


k € (0, 0.5]. 

The error O(k?|1n k|) makes it difficult to accelerate convergence by using 
a larger value of k (i.e. a value of x = 4/k smaller than 2”/2) There is an exact 
formula which is much more elegant and avoids this problem. Before giving 
this formula, we need to define some theta functions and show how they can 
be used to parameterize the AGM iteration. 


4.8.3 Theta functions 
We need the theta functions 62(q), 63(q) and @4(q), defined for |q| < 1 by 


-+oo +00 
@2(g) = D> g™¥2) = agl/4 5 grime), (4.74) 
n=—0o n=0 
ube 2 oo 2 
Ax(q)= Dg =14+2> 040", (4.75) 
n=—oo n=l 
+00 7 
64(q) = 3(—g) =1+250(-1)"q”. (4.76) 
n=1 


Note that the defining power series are sparse, so it is easy to compute la,)l 
and 63(q) for small g. Unfortunately, the rectangular splitting method of §4.4.3 
does not help to speed up the computation. 

The asymptotically fastest methods to compute theta functions use the AGM. 
However, we do not follow this trail, because it would lead us in circles! We 
want to use theta functions to give starting values for the AGM iteration. 
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Theta function identities. There are many classical identities involving theta 
functions. Two that are of interest to us are 


63 (9) + 93(9) 
2 


The latter may be written as 


=63(q?) and 63(q)04(q) = 03(q”). 


63(9)05 (4) = 03(4") 


to show the connection with the AGM 


for any |q| < 1. (The limit is 1 because ge converges to 0, thus both 63 and 
04 converge to 1.) Apart from scaling, the AGM iteration is parameterized by 
(03(q?"),63(q* )) for k = 0, 1,2,.... 


The scaling factor. Since AGM(63(q), 07(q)) = 1, and AGM(Aa, Ab) = 
A+ AG ), scaling gives AGM(1,k") = 1/03 (q) if k’ = 6% (q)/03(q). 
Equivalently, since 63 + 6¢ = 03 (Jacobi), k = 03(q)/03(q). However, we 
know (from (4.68) with k — k’) that 1/ AGM(1, k’) = 2K (k)/z, so 
1 

K(k) = 3 93(4)- (4.77) 
Thus, the theta functions are closely related to elliptic integrals. In the literature 
q is usually called the nome associated with the modulus k. 


From q to k and k to q. We saw that k = 03(q)/03(q), which gives k in 
terms of g. There is also a nice inverse formula which gives q in terms of k:: 
q = exp(—7K'(k)/K(k)), or equivalently 


¥()- ws 


— ES co 


Sasaki and Kanada’s formula. Substituting (4.68) and (4.77) into (4.78) 
with k = 63(q)/02(q) gives Sasaki and Kanada’s elegant formula 


1 T 
a (;) ~ AGM(@3(q),03(0) | wan 


This leads to the following algorithm to compute In x. 
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i) 
4.8.4 Second AGM algorithm for the logarithm 


Suppose zx is large. Let ¢ = 1/x, compute 02(q*) and @3(q*) from their defin- 
ing series (4.74) and (4.75), then compute AGM(03(q*), 63(q*)). Sasaki and 
Kanada’s formula (with q replaced by q‘ to avoid the q'/* term in the definition 
of 62(q)) gives 


1/4 
DO = Kexena.aay) 
There is a trade-off between increasing x (by squaring or multiplication by a 
power of 2, see the paragraph on “Argument Expansion” in §4.8.2), and taking 
longer to compute 62(q*) and @3(q*) from their series. In practice, it seems 
good to increase x until g = 1/zx is small enough that O(q°°) terms are negli- 
gible. Then we can use 


62(q*) =2(q+q? +q +O(q"")), 


63(q*) =1+2(q*+q'°+O(q*)). 


We need x > 2”/36, which is much better than the requirement « > 2”/? for 
the first AGM algorithm. We save about four AGM iterations at the cost of a 
few multiplications. 


Implementation notes. Since 


AGM(63 + 63, 2020 
AGM(03, 93) = AGMG2 + 65.2005) 
we cfr ayoid the first square root in the AGM iteration. Also, it only takes two 
non-scal4rmmaltiplications to compute 20203 and 63 + 03 from 62 and 63: see 
Exercise 4.45. Another speedup is possible by trading the multiplications for 


squares, see §4.12. 


Drawbacks of the AGM. The AGM has three drawbacks: 


1. The AGM iteration is not self-correcting, so we have to work with full pre- 
cision (plus any necessary guard digits) throughout. In contrast, when us- 
ing Newton’s method or evaluating power series, many of the computations 
can be performed with reduced precision, which saves a log n factor (this 
amounts to using a negative number of guard digits). 

2. The AGM with real arguments gives 1 (x) directly. ToLobihin exp(x), we 
need to apply Newton’s method (84.2.5 and Exercise 4.6). To evaluate 
trigonometric functions such as sin(2), cos(x), arctan(2), we need to work 
with complex arguments, which increases the constant hidden in the “O” 
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time bound. Alternatively, we can iapahaea transformations for incom- 
plete elliptic integrals, but this gives even larger constants. 

3. Because it converges so fast, it is difficult to speed up the AGM. At best we 
can save O(1) iterations (see however §4.12). 


4.8.5 The complex AGM 


In some cases, the asymptotically fastest algorithms require the use of complex 
arithmetic to produce a real result. It would be nice to avoid this because com- 
plex arithmetic is significantly slower than real arithmetic. Examples where we 
seem to need complex arithmetic to get the asymptotically fastest algorithms 
are: 


1. arctan(x), arcsin(x), arccos(x) via the AGM, using, for example, 
arctan(#) = S(In(1 + ix)); 
2. tan(x), sin(a), cos(x) using Newton’s method and the above, or 


cos([E) +} sin(x) = exp(iz), 
where the complex exponential is computed by Newton’s method from the 
complex logarithm (see Eqn. (4.11)). 


The theory that we outlined for the AGM iteration and AGM algorithms for 
In(z) can be extended without problems to complex z ¢ (—oo, 0], provided 
we always choose the square root with positive real part. 

A complex multiplication takes three real multiplications (using Karatsuba’s 
trick), and a complex squaring takes two real multiplications. We can do even 
better in the FFT domain, assuming that one multiplication of cost M/(n) is 
equivalent to three Fourier transforms. In this model, a squaring costs 2M (n)/3. 
A complex multiplication (a + ib)(c + id) = (ac — bd) + i(ad + bc) requires 
four forward and two backward transforms, and thus costs 2//(n). A complex 
squaring (a + ib)? = (a + b)(a — b) + i(2ab) requires two forward and two 
backward transforms, and thus costs 4M(n)/3. Taking this into accou 
get the asymptotic upper bounds relative to the cost of one ata Fae 
in Table 4.1 (0.666 should be interpreted as ~ 2 (n)/3, and so on). See §4.12 
for details of the algorithms giving these constants. 


4.9 Binary splitting 


Since the asymptotically fastest algorithms for arctan, sin, cos, etc. have a 
large constant hidden in their time bound O(M(n) log n) (see “Drawbacks of 
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Operation real complex 
squaring 0.666 1.333 
multiplication 1.000 2.000 
reciprocal 1.444 3.444 
division 1.666 4.777 
square root 1.333 5.333 
AGM iteration 2.000 6.666 
log via AGM 4.000 Ign | 13.333Ig¢n 


— Table 4.1 Costs in the FFT domain. 


the , §4.8.4, page 162), it is interesting to look for other algorithms that 
may be competitive for a large range of precisions, even if not asymptotically 
optimal. One such algorithm (or class of algorithms) is based on binary split- 
ting (see §4.12). The time complexity of these algorithms is usually 


O((log n)*M(n)) 


for some constant a > 1 depending on how fast the relevant power series 
converges, and also on the multiplication algorithm (classical, Karatsuba, or 
quasi-linear). 


The idea. Suppose we want to compute arctan(x) for rational x = p/q, 
where p and q are small integers and |x| < 1/2. The Taylor series gives 


—1)J p25t1 
arctan (2) ~ Pe 
i} san te 


The finite sum, if computed exactly, gives a rational approximation P/Q to 
arctan(p/q), and 


log |Q| = O(nlog n). 


(Note: the series for exp converges faster, so in this case we sum ~ n/ Inn 
terms and get log |Q| = O(n).) 

The finite sum can be computed by the “divide and conquer” strategy: sum 
the first half to get P; /Q1 say, and the second half to get P2/Qz, then 


P—Pm Rm _ PQo+ PQ 
Q Qi Q QiQe2 


The rationals P;/Q; and P:/Q 2 are computed by a recursive application of 
the same method, hence the term “binary splitting”. If used with quadratic 
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multiplication, this way of computing P/Q does not help; however, fast mul- 
tiplication speeds up the balanced products P;Q2, P2Qi, and Q1Q2. 


Complexity. The overall time complexity is 


[lg(n)] 
O|{ S> 2* M(2-*nlogn) = O((logn)*M (ny (4.80) 


k=1 


where a = 2 in the FFT range; in general a < 2 (see Exercise 4.47). 

We can save a little by working to precision n rather than n log n at the top 
levels; but we still have a = 2 for quasi-linear multiplication. 

In practice, the multiplication algorithm would not be fixed but would de- 
pend on the size of the integers being multiplied. The complexity would de- 
pend on the algorithm(s) used at the top levels. 


R ted application of the idea. If x € (0,0.25) and we want to compute 

Leted(2), we can approximate x by a rational p/q and compute arctan(p/q) 
as a first approximation to arctan(x), say p/g < x < (p+ 1)/q. Now, 
from (4.17) 


a 


«—p/q 
tan(arctan(x7) — arctan ee 
( (x) (p/q)) eee 
so 
arctan(#) = arctan(p/q) + arctan(0d), 
where 


_ @—p/q _qe-p 
l+par/q  q+px 


We can apply the same idea to approximate arctan(d). Eventually we get a 
sufficiently accurate approximation to arctan(). Since |5| < |z—p/q| < 1/q, 
it is easy to ensure that the process converges. 


Complexity of repeated application. If we use a sequence of about lg n ra- 
tionals p1/q1, p2/q2,---, Where 


Gi = a 


then the computation of each arctan(p;/q;) takes time O((logn)°M(n)), and 
the overall time to compute arctan(z) is 


O((logn)**! M(n)). 
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Indeed, we have 0 < p; < 9?"~". thus, p; has at most 2*~+ bits, and p;/q; 
as a rational has value O(2~2' ') and size O(2'). The exponent a + 1 is 2 
or 3. Although this is not asymptotically as fast as AGM-based algorithms, the 
implicit constants for binary splitting are small and the idea is useful for quite 
large n (at least 10° decimal places). 


[__ Generalizations. The idea of binary splitting can be generalized. For exam- 
ple, the Chudnovsky brothers gave a “bit-burst” algorithm, which applies to 
fast evaluation of solutions of linear differential equations. This is described in 
84.9.2. 


L 


4.9.1 A binary splitting algorithm for sin, cos 


Brent [45, Theorem 6.2] claims an O(M(n) log? n) algorithm for exp x and 
sin x; however, the proof only covers the case of the exponential and ends with 
“the proof of (6.28) is similar”. He had in mind deducing sin x from a complex 
computation of exp(¢z) = cosx + isin. Algorithm SinCos is a variation 
of Brent’s algorithm for exp x that computes sin z and cos a simultaneously, 
in a way that avoids computations with complex numbers. The simultaneous 
computation of sin and cos x might be useful to compute tan x or a plane 
rotation through the angle «. 


Algorithm 4.4 SinCos 

Input: floating-point 0 < x < 1/2, integer n 

Output: an approximation of sin x and cos x with error O(2~") 
: write x © Sop; -272'"* where 0 < p; < 2” andk = [Ign] —1 
: let lj = a Pi: go with 7,41 = 0, and Yj = Pj° g-2 


1 

2 

3: (S41; Cr41) — (0,1) > S; is sina, and C; is cos x; 
4: for j from k downto 0 do 

5 compute sin y; and cos y; using binary splitting 

6 5; —_ sin yj : Cy4i + cos Yj" Si41, C; +— COS Yi° Cy41 —sin Yi" Sy41 
7: return (59, Co). 


O 


At step 2 of Algorithm SinCos, we have 7; = y; + 741; thus, sina; = 
Sin yj COS 2541 my cos yjlsdn xj41, and similarly for cos x;, explaining the for- 
mule used at step 6. Step 5 uses a binary splitting algorithm similar to the 
one described above for arctan(p/q): y; is a small rational, or is small itself, 
so that all needed powers do not exceed n bits in size. This algorithm has the 
same complexity O(M(n) log? n) as Brent’s algorithm for exp x. 
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4.9.2 The bit-burst algorithm 


The binary-splitting algorithms described above for arctan x, exp x, sin x rely 
on a functional equation: tan(# + y) = (tana + tany)/(1 — tanxtany), 
exp(x + y) = exp(x) exp(y), sin(a + y) = sina cosy + sinycosx. We 
describe here a more general algorithm, known as the “bit-burst’” algorithm, 
which does not require such a functional equation. This algorithm applies to 
a class of functions known as holonomic functions. Other names are differen- 
tiably finite and D-finite. 

A function f (2) is said to be holonomic iff it satisfies a linear homogeneous 
differential equation with polynomial coefficients in x. Equivalently, the Taylor 
coefficients u;,, of f satisfy a linear homogeneous recurrence with coefficients 
polynomial in /;. The set of holonomic functions is closed under the operations 
of addition and multiplication, but not necessarily under division. For example, 
the exp, In, sin, cos functions are hol, ic, but tan is not. 

An important subclass of an at eal is the hypergeometric func- 
tions, whose Taylor coefficients satisfy a recurrence uz+1/u, = R(k), where 
R(k) is a rational function of k (see §4.4). This matches the second defini- 
tion above, because! an write it as up41Q(k) — u,P(k) = Oif R(k) = 
P(k)/Q(k). Holonomic functions are much more general than hypergeometric 
functions (see Exercise 4.48); in particular, the ratio of two consecutive terms 
in a hypergeometric series has size O(log k) (as a rational number), but can be 
much larger for holonomic functions. 


Theorem 4.2 Jf f is holonomic and has no singularities on a finite, closed 
interval [A, B], where A < 0 < Band f(0) = 0, then f(x) can be com- 
puted to an (absolute) accuracy of n opie any n-bit floating-point number 
a € (A, B), in time O(M(n) log? n). 


NOTES: For a sharper result, see Exercise 4.49. The condition f(0) = 0 is just 
a technical condition to simplify the proof of the theorem; f(0) can be any 
value that can be computed to n bits in time O(M(n) log® n). 


Proof. Without loss of generality, we assume 0 < 2 < 1 < B; the binary 
expansion of x can then be written x = 0.b,b2...b,. Define ry; = 0.04, 
rg = 0.0b2b3, r3 = 0.000b4b5b¢b7 (the same decomposition was already used 
in Algorithm SinCos): 7; consists of the first bit of the binary expansion of 2, 
rz consists of the next two bits, 73 the next four bits, and so on. Thus, we have 
c=ryt+rot...+r,, where 2*-! <n < 2". 

Define 7; = rj +--+: + 7; with x = 0. The idea of the algorithm is to 
translate the Taylor series of f from x; to 7:41; since f is holonomic, this 
reduces to translating the recurrence on the corresponding coefficients. The 
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condition that f has no singularity in [0, z] C [A, B] ensures that the translated 
recurrence is well-defined. We define f(t) = f(t), fi(t) = fo(ri+t), fo(t) = 
filra +t), ..., falt) = fi-a(ri + t) for i < k. We have fi(t) = f(xi + 2), 
and f;,(t) = f(a +t) since x, = x. Thus, we are looking for f,(0) = f(z). 

Let f7(¢) = fi(t) — f;(0) be the non-constant part of the Taylor expansion 
of f;. We have fi(rizi) = fi(rizi) — fi(O0) = fi+1(0) — fi(0) because 
fisi(t) = fi(rizi + t). Thus 


fo (ri) +++ + fg_-a(re) = (f1(0) — fo(0)) ++ +> + (fe(0) — fr-1(0)) 
fx(0) — fo(0) = f(z) — FO). 


Since f(0) = 0, this gives 
k-1 
f@=S" ire). 
i=0 
To conclude the proof, we will show that each term f;*(rj+1) can be evalu- 
ated to n bits in time O(M(n) log? n). The rational r;, has a numerator of at 
most 2° bits, and 


O0< Tigi < ae 


Thus, to evaluate f*(rj41) to n bits, n/2? + o(ided) terms of the Taylor 
expansion of f;(¢) are enough. We now use the fact that f is holonomic. 
Assume f satisfies the following homogeneous linear!! differential equation 
with polynomial coefficients 


Cm(t) f(t) +--+ + x(t)’ (t) + colt) fF) = 0. 
Substituting x; + t for t, we obtain a differential equation for f; 
Cm(ai +t) fe (t) + ++ + effet ty fi(t) + co(ws + t) filt) = 0. 


From this equation, we deduce (see 84.12) a linear recurrence for the Taylor 
coefficients of f;(t), of the same order as that for f(t). The coefficients in the 
recurrence for f;(t) have O(2") bits, since x; = 71 +--+ + 7; has O(2*) bits. 
It follows that the éth Taylor coefficient of f;(t) has size O(€(2' + log @)). 
The @log@ term comes from the polynomials in @ in the recurrence. Since 
<n/2' + O(log n), this is O(n log n). 

However, we do not want to evaluate the th Taylor coefficient ue of f;(t), 
11 If f satisfies a non-homogeneous differential equation, say 


E(t, f(t), f’(t),-.., f(t) = b(t), where b(t) is polynomial in t, differentiating it yields 
F(t, f(t), f'(t),-.., fT) (8) = b'(t), and b/ (t) E(-) — b(t) F(-) is homogeneous. 
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but the series 
e 
= P< ee pie 
7 S UjTi41 ~ fi (ri41)- 
= 


Noting that ue = (se—se_1)/ rf 1, and substituting this value in the recurrence 
for (ue), say of order d, we obtain a recurrence of order d + 1 for (s¢). Putting 
this latter recurrence in matrix form Sg = MyS,_,, where S? is the vector 
(Se, Se—-1,---, Se—a), We obtain 


Se = MpMp_-1-+ May Sa, (4.81) 


where the matrix plodilct MeMe_1---Ma+1 can be evaluated in time 
O(M(n) log? n) using binary splitting. 


We illustrate Theorem 4.2 with the arc-tangent function, which satisfies the 
differential equation f’(t)(1 + t?) = 1. This equation evaluates at x; + t to 
fi(t)(1 + (a + t)?) = 1, where f;(t) = f(x; + t). This gives the recurrence 


(1 + 2?) buy + 2a;(€ — 1)ug_1 + (€— 2)ug_2 = 0 


for the Taylor coefficients uz of f;. This recurrence translates to 


(1+ x7 )eve + 2ririga(€ — 1)ve_1 +7744 (€ — 2)ve_2 = 0 
for ve = uer{, 1, and to 
(1+ 27) €(s¢ — s¢_1) 
+ 2xirizi(l = 1)(Se-1 Se_2) t mile 2)(se_2 _ 50-3) =0 


for se = pa v;. This recurrence of order 3 can be written in matrix form, 
and Eqn. (4.81) enables us to efficiently compute s¢ ~ f(r; +1)— f;(0) using 
multiplication of 3 x 3 matrices and fast integer multiplication. 


4.10 Contour integration 


In this section, we assume that facilities for arbitrary-precision complex arith- 
metic are available. These cah—-be beHt on top of an arbitrary-precision real 
arithmetic package (see Chapters 3 and 5). 

Let f(z) be holomorphic in the disc |z| < R, R > 1, and let the power 
series for f be 


fQ)=>) a; 2. (4.82) 
j=0 


EJ 
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From Cauchy’s theorem [122, Ch. 7], we have 


ap fa 


Qri Jo zit 


tie = (4.83) 


where C’ is the unit circle. The contour integral in (4.83) may be approximated 
numerically by sums 


k-1 
fle mei e2rigm/k (4.84) 


a 


Sik = 


m=0 


Let C’ be a circle with centre at the origin and radius p € (1, R). From 
Cauchy’s theorem, assuming that 7 < k, we have (see Exercise 4.50) 


1 f 
Sik — 3 = 5 i (zk — 1)2i41 dz=ajte + ajpar+-+*, (4.85) 


so |Sj,~ — aj| = O((R — 6)~ 9+") as k — 00, for any 6 > 0. For example, 
let 


fe=sat5 Ca.86) 


be the generating function for the scaled Bernoulli numbers as in (4.57), so 
ag; = Cj = Bo; /(27)! and R = 2z (because of the poles at +277). Then 


Ss Boy —  Bajik Bajar 
Qj,k 


Qj)! @j+h!’ @+amit een) 


so we can evaluate B5; with relative errr Of{ 2)" ) by evaluating f(z) at k 
points on the unit circle. 

There is some cancellation when using (4.84) to evaluate Sj ;, because the 
terms in the sum are of order unity but the result is of order (27)~?’. Thus, 
O(j) guard digits are needed. In the following, we assume 7 = O(n). 

If exp(—27ijm/k) is computed efficiently from exp(—2z77/k) in the obvi- 
ous way, the time required to evaluate Bo,...,B2; to precision n is 
O(jnM(n)), and the space required is O(n). We assume here that we need 
all Bernoulli numbers up to index 27, but we do not need to store all of them 
simultaneously. This ane if we are using the Bernoulli numbers as coef- 
ficients in a sum such as (4.38). | 

The recurrence relation method of §4.7.2 is faster but requires space O(jn). 
Thus, the method of contour integration has advantages if spake is kritical. 

For comments on other forms of numerical quadrature, see §4.12. 
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Exercise 4.1 If A(x) = 7 js0 a,;x) is a formal power series over R with 
ao = 1, show that In(A(x)) can be computed with error O(x") in time 
O(M(n)), where M(n) is the time required to multiply two polynomials of 
degree n — 1. Assume asonable smoothness condition on the growth of 
M(n) as a function of-7z{Hint: (d/dz) In(A(z)) = A’(x)/A(x).] Does a 
similar seo for n-bit numbers if x is replaced by 1/2? 

Exercise 4.2 (Schénhage [197] and Schost) Assume we want to compute 


1/s(a) mod x”, for s(x) a power series. Design an algorithm using an odd— 
even scheme (§1.3.5), and estimate its complexity in the FFT range. 


Exercise 4.3 Suppose that g and h are sufficiently smooth functions satisfying 
g(h(x)) = & on some interval. Let y; = h(x,;). Show that the iteration 


k-1 
(m) (a). 
g"" (y3) 
ja = 29+ > (y-—ys)™ I : 
m=1 . 


is a kth-order iteration that (under suitable conditions) will converge to x = 
g(y). [Hint: generalize the argument leading to (4.16).] 


Exercise 4.4 Design a Horner-like algorithm for evaluating a series i % a,x) 
in the forward direction, while deciding dynamically where to stop. For the 
stopping criterion, assume that the |a;| are monotonic decreasing and that 
|x| < 1/2. [Hint: use y = 1/z.] — | 

Exercise 4.5 Assume we want n bits of exp for x of order 2’, with the 
repeated use of the doubling formula (84.3.1), and the naive method to evaluate 
power series. What is the best reduced argument 2/2" in terms of n and 3? 
[Consider both cases 7 > 0 and j < 0.] 


Exercise 4.6 Assuming we can compute an n-bit approximation to Inx in 
time T(n), wheren < M(n) = o(T(n)), show how to compute an n-bit 
approximation to exp x in time ~ T(n). Assume that T'(n) and M(n) satisfy 
reasonable smoothness conditions. 


Exercise 4.7 Care has to be taken to use enough guard digislwhes computing 
exp(x) by argument reduction followed by the power series (4.21). If x is of 
order unity and k steps of argument reduction are used to compute exp() via 


exp() = (exp(2/2*))” , 


show that about & bits of precision will be lost (so it is necessary to use about 
k; guard bits). 


Cc] 
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Exercise 4.8 Show that the problem analysed in Exercise 4.7 can be avoided 
if we work with the function 


expm1 (a) = exp(x) — 1 = ) ri 
j=l 


which satisfies the doubling formula expm1(2x7) = expm1(x)(2+expm1(z)). 


Exercise 4.9 For x > —1, prove the reduction formula 


: a 
14+ J/1+ -) , 
where the function ] (x) is defined by loglp(a) = In(1 + ), as in §4.4.2. 
Explain why it might De desirable to work with loglp instead of In in order 
to avoid loss of precision (in the argument reduction, rather than in the recon- 


struction as in Exercise 4.7). Note however that argument reduction for log1p 
is more expensive than that for expm1, because of the square root. 


loglp(x) = 2loglp ( 


Exerclse-40 Give a numerically stable way of computing sinh(x) using one 
evaluation of expm1(||) and a small number of additional operations (com- 
pare Eqn. (4.20)). 


Exercise 4.11 (White) Show that exp(x) can be computed via sinh(x) using 
the formula 


exp(x) = sinh(x) + 1/14 sinh?(z). 


Since 
x —2x 2k 


Ee —e 
sinh(r) = ————_ = )_ ——_, 
A | 
= (2k + 1)! 


2 
this saves computing about half the terms in the power series for exp() at the 
expense of one square root. How can we modify this method to preserve nu- 
merical stability for negative arguments x? Can this idea be used for functions 
other than exp(a)? 


Exercise 4.12 Count precisely the number of 7 ee products necessary 
for the two variants of rectangular series splitting (§4.4.3). = 


Exercise 4.13 A drawback of rectangular series splitting as presented in §4.4.3 
is that the coefficients (ax¢4, in the classical splitting, or aj,,+,¢ in the modular 
splitting) involved in the scalar multiplications might become large. 
Indeed, they are typically a product of factorials, and thus have size 
O(dlog d). Assuming that the ratios a;41/a; are small rationals, propose an 
alternate way of evaluating P(r). 


= 
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Exercise 4.14 Make explicit the cost dette! slowly growing funkrtom! c(d) 
(84.4.3). 
C 


Exercise 4.15 Prove the remainder term (4.28) in the expansion (4.27) for 
E,(a). [Hint: prove the result by induction on k, using integration by parts 
in the formula (4.28). ] 


Exercise 4.16 Show that we can avoid using Cauchy principal value integrals 
by defining Ei(z) and E;(z) in terms of the entire function 


ee g 1— exp(4fT 3 ae Ce co 


t : 
j=l 
Exercise 4.17 Let E; (a) be defined by (4.25) for real x > 0. Using (4.27), 
show that 


1 ‘ 1 
Exercise 4.18 In this exercise, the series are purely formal, so ignore any ques- 
tions of convergence. Applications are given in Exercises 4.19-4.20. 

Suppose that (a;)jen is a sequence with exponential generating function 
s(z) = ae a;z1/j!. Suppose that A, = 059 (")a;, and let S(z) = 
EO A;z! /j! be the exponential generating function of the sequence (Ap, )nen. 
Show that 


S(z) = exp(z)s(z). Lt] 


Exercise 4.19 The POWET Spries for Ein(z) given in Exercise 4.16 suffers from 
catastrophic cancellation when z is large and positive (like the series for 
exp(—z)). Use Exercise 4.18 to show that this problem can be avoided by 
using the power series (where H,, denotes the nth harmonic number) 


= C4 


Exercise 4.20 Show that Eda4)23) for erf(a) follows from Eqn. (4.22). 
[Hint: this is similar to Exercise 4.19.] 


Exercise 4.21 Give an algorithm to evaluate ['(2) for real x B i)2, with guar- 
anteed relative error O(2~"). Use the method sketched in §4.5 for InI'(x). 
What can be said about the complexity of the algorithm? 


= 
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Exercise 4.22 Extend your solution to Exercise 4.21 to give an algorithm to 
evaluate 1/T'(z) for z € C, with guaranteed relative error O(2~"). Note: ['(z) 
has poles at zero and the negative integers (i.e. for —z € N), but we over- 
come this difficulty by computing the entire function 1/['(z). Warning: |['(z)| 
can be very small if S(z) is large. This follows from Stirling’s asymptotic 
expansion. In the particular case of z = zy on the imaginary axis, we have 


A Tv 


More generally 


Dw + spay Qu? exp(—alyl) 
for x,y € Rand |y| large. 


Exercise 4.23 The usual form (4.38) of Stirling’s approximation for In(T'(z)) 
involves a divergent series. It is possible to give a version of Stirling’s approx- 
imation where the series is convergent 


1 _In(2z) | as Ck 
7 (4.88) 
where the constants c;, can be expressed in terms of Stirling numbers of the 
first kind, s(n, k), defined by the generating function 


Y > s(n, k)a* = x(a -—1)---(e-n+1). 


k=0 


In fact 


The Stirling numbers s(n, &) can be computed easily from a three-term recur- 
rence, so this gives a feasible alternative to the usual form of Stirling’s approx- 
imation with coefficients related to Bernoulli numbers. 

Show, experimentally and/or theoretically, that the convergent form of Stir- 
ling roximation is not an improvement over the usual form as used in 
Exercise 4.21. 


Exercise 4.24 Implement procedures to dvaludte E,(a) to high precision for 
ikal pbsitive x, using (a) the power series (4.26), (b) the asyrhptotle expan- 
sion (4.27) (if suffidiently accurate), (c) the method of Exercise 4.19, and (d) 
the continued fraction (4.39) using the backward and forward recurrences as 


L_] 
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suggested in $4.6. Determine empirically the regions where each method is the 
fastest. 


Exercise 4.25 Prove the backward recurrence (4.43). 


Exercise 4.26 Prove the forward recurrence (4.44). 


[Hint: let 
ay Qk-1 ak 


(¢)= Lee : 
Se he 
Show, by induction on k > 1, that 
pie, 
Qk + Qp-12 


Exercise 4.27 For the forward recurrence (4.44), show that 


Qe Qr-1 \)_fh 1 bo 1\ f bk 1 
P,, Pee 1 ay 0) ag 0 ak 0 
holds for k > 0 (and for k = 0 if we define P_;, Q_1 appropriately). 


Remark. This gives a way to use parallefism when evaluating continued frac- 
tions. 


Exercise 4.28 For the forward recurrence (4.44), show that 


Qe Q k 
= (—1)"ay,aQ--- ap. 
BR Pe (-1)"aya2 k 
Exercise 4.29 Prove the iddartity (4.46). 
Exercise 4.30 Prove Theorem 4.1. | 


Exercise 4.31 Investigate using the continued fraction (4.40) for evaluating 
the complementary error function e ) or the error function erf(z) = 1 — 
erfce(x). Is there a region where the continued fraction is preferable to any of 
the methods used in Algorithm Erf of §4.6? L_] 


Exercise 4.32 Show that the continued fraction (4.41) can fe ae in time 
O(M(k) log k) if the a; and b; are bounded integers (or rat umbers with 
bounded numerators aiid dehominators). [Hint: use Exercise 4.27.] 


Exercise 4.33 Instead of (4.54), a different normalization condition 
Joa? +2 5° I(2)? =1 (4.89) 
v=1 


could be used in Miller’s algorithm. Which of these normalization conditions 
is preferable? 
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Exercise 4.34 Consider the recurrence f,_1 + f,41 = 2K f,, where K > 0 
is a fixed real constant. We can expect the solution to this recurrence to give 
some insight into the behavior of the recurrence (4.53) in the region vy & Ira. 
Assume for simplicity that k # 1. Show that the general solution has the form 


fy = AX” + Bu", 


where \ and y are the roots of the quadratic equation x? — 2K'x +1 = 0, and 
A and B are constants determined by the initial conditions. Show that there are 
two cases: if kK < 1, then \ and p are i. conjugates on the unit circle, 


Call | = |u| = 1; if K > 1, then there real roots satisfying Aj = 1. 


Exercise 4.35 Prove (or give a plausibility ay nt for) the statements made 
in §4.7 that: (a) if a recurrence based on (4.59) 1s used to evaluate the scaled 
Bernoulli number C;,, using precision n arithmetic, then the relative error is of 
order 4*2-”; and (b) if a recurrence baseff_on |4.60) is used, then the relative 


error is D(k*p~”). 
Exercise 4.36 Starting from the definition (4.56), prove Eqn. (4.57). Deduce 
the relation (4.62) connecting tangent numbers and Bernoulli numbers. 


Exercise 4.37 (a) Show that the number of bits required to represent the tan- 
gent number T;, exactly is ~ 2k lg k as k — oo. (b) Show that the same applies 
| the exact representation of the es hae number Bo; as a rational number. 


Exercise 4.38 Explain how the correctness of Algorithm TangentNumbers 
(84.7.2) follows from the recurrence (4.63). 


Algorithm 4.5 SecantNumbers 
Input: positive integer m 
Output: Secant numbers So, 51,..., Sim 
Sool 
for k from 1 tom do 
Sk — kSp-1 
for k from 1 to m do 
for j from k + 1 tom do 
S53 — G—k)Sj-1 + (G—F+ 1S; 


return So, 5 ,..., Sm. 


Exercise 4.39 Show that the complexity of_computing the tangent numbers 
T,,..., Im by Algorithm TangentNumbers (84.7.2) is O(m* log m). Assume 
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that the multiplications of tangent numbers 7; by small integers take time 
O(log T;). [Hint: use the result of Exercise 4.37.] 


Exercise 4.40 Verify that Algorithm SecantNumbers computes in-place the 
Secant numbers 5j,, defined by the generating function 


zk ae a 
> 5k EI = secr = 


; 
COS © 
k>0 


in much the same way that Algorithm TangentYumfjers (84.7.2) computes 
the Tangent numbers. 


Exercise 4.41 (Harvey) The generating function (4.56) for Bernoulli num- 
bers can be written as 


xk i pik 
S Bre==1 S —— > 
] ; | 
kl we (k +1)! 


k>0 


and we can use an asymptotically_fast-hlgorithm to compute the first n + 1 
terms in the reciprocal of the power series. This should be asymptotically faster 
than using the recurrences given in §4.7.2. Give an algorithm using this idea 
to compute the Bernoulli numbers Bo, Bi,..., Bp, in time O(n? (log n)?**). 
Implement your algorithm and see how large n needs to be for it to be faster 
than the algorithms discussed in 84.7.2. 


Algorithm 4.6 SeriesExponential 


Input: positive integer m and real numbers a1, @2,...,@m 
Output: real numbers bo, b;,..., by, such that 
bo + bya +++: + bma™ = exp(are +--+ +a@m2™) + O(2™T") 
bo «+1 


for k from 1 to m do 
bi (Soja dajde—s) /k 


return bo, b1,...,0m- 


Exercise 4.42 (a) Show that Algorithm SeriesExponential computes B(x) = 
exp(A(z)) up to terms of order 7”"*!, where A(x) = ayrt+agr?4+-+-+am2" 
is inbut Hata and B(x) = bp +b)2 +--+ +b 2x" is the output. [Hint: compare 
Exercise 4.1.] 
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(b) Apply this to give an algorithm to compute the coefficients b; in 
Stirling’s approximation for n! (or [(n + 1)): 
n 
Iw (—]) V2 —. 
n ( a) mn, S> - 
[Hint: we know the coefficients in Stirling’s approximation (4.38) for InI'(z) 
in terms of Bernoulli numbers. | 


(c) Is this likely to be useful fde-hich-precision- Lomputation of I'(a) for real 
positive x? 


Exercise 4.43 Deduce from Eqn. (4.69) and (4.70) an expansion of In(4/k) 
with error term O(k*log(4/k)). Use any means to figure out an effective 
bound on the O() term. Deduce an algorithm requiring only x > 2”/4 n 
bits of Inz. 


Exercise 4.44 Show how both z and In 2 can be evaluated using Eqn. (4.71). 


Exercise 4.45 In §4.8.4, we mentioned that 20203 and 63 + 62 can be com- 
puted using two non-scalar multiplications. For example, we could (A) com- 
pute u = (4 + 3)? and v = 0263; then the desired values are 2v and u — 2v. 
Alternatively, we could (B) compute u and w = (02 — 63)”; then the desired 
values are (u + w)/2. Which method (A) or (B) is preferable? 


Exercise 4.46 Improve ae ee in Table 4.1. 


Exercise 4.47 Justify Eqn. (4.80) and give an upper bound on the constant a 
if the multiplication algorithm satisfies /(n) = O(n°) for some c € (1, 2]. 


Exercise 4.48 (Salvy) Is the function exp(x?) + x/(1 — 2?) holonomic? 


Exercise 4.49 (van der Hoeven, Merarobba) Improve to O(M(n) log? n) 
the complexity given in Theorem 4.2. 


Exercise 4.50 If w = e27*/*, show that 


i tS ae 
zk — IA Sears 


m=0 


Deduce that 5; ;,, defined by Eqn. (4.84), satisfies 


i ae 
Sik = dai [ : i f(z) dz — 


for 7 < k, where the contour C’ is as in §4.10. Deduce Eqn. (4.85). 


| 
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Remark. Eqn. (4.85) illustrates the oe seg of aliasing: observations at 
k points can not distinguish between the Foufier coefficients a;, aj+%, @j+2k> 
etc. 


Exercise 4.51 Show that the sum_S2J ;, of §4.10 can be computed with (essen- 
tially) only about //4 evaluations of f if k is even. Similarly, show that about 
k,/2 evaluations of f suffice if k is odd. On the other hand, show that the error 
bound O((27)~*) following Eqn. (4.87) can be improved if k is odd. 


4.12 Notes and references 


One of the main references for special frogs is the ee 
matical Functions” by Abramowitz and Stegtn [1], which gives many useful 
resultsrbut no proofs. A more i book is that of Nico Temme [214], and 
a comprehensive reference is Andrews et al. [4]. A-large part of the content 
of this chapter comes from Brent [48], and was i mented in the MP pack- 
age Brent [47]. In the context of floating-point computations, the Roi i 
of Floating-Point Arithmetic” by Brisebarre et al. [58] is a useful refe 3 
Se ee Hee 11. 

The SRT algorithm for division idnarbed after Sweeney, Robertson [189] 
and Tocher [216]. Original papers on Booth recoding, SRT division, etc., are 
reprinted in the book by Swartzlander [212]. SRT division is similar to non- 
restoring division, but uses a lookup table based on the dividend and the divisor 
to determine each quotient digit. The Intel Pentium fdiv bug was caused by 
an incorrectly initialized lopkhip table. 

Basic material on Newton’s method may be found in many references, for 
example t oks by Brent [41, Ch. 3], Householder [126] or Traub [218]. 
Some deta: the use of Newton’s method in modern pgs OL be 
found in Infe 8]. The idea of first computing y~!/2, then multiplying by 
y to get y!/? (§4.2.3) was pushed further by K: Markstein [137], who 
perform this at the penultimate iteration, and m e last iteration of New- 


ton’s method f iB 4 [Bo (see §1.4.5 for an example of the 
parents | Jellol : ore on Newton’s method for power 
series, we refer 3, 52, 56, 142, 202]. 

Some g on on error ia. of floating-point algorithms are the 
books by Ea 121] and Muller [174]. Older references include Wilkin- 
son’s classics [228, 229]. 


Regarding doubling versus tripling: in §4.3.4, we assumed that one[multl- 
plication and one squaring were required to apply the tripling formula (4.19). 


180 Elementary and special function evaluation 


However, we might use the form sinh(3x) = 3sinh(«) +4 sinh®(x), which re- 
quires only one cubing. Assuming a cubing costs 50% more than a squaring — 
in the FFT range — the ratio would be 1.5 log; 2 = 0.946. Thus, if a special- 
ized cubing nie available, tripling may sometimes be slightly faster than 
doubling. 

For an example of a détailed error analysis of an unrestricted algorithm, see 
Clenshavf and Olver [69]. 

The idea of rectangular series splitting to evalliate a power series with O(,/7) 
non-scalar multiplications (§4.4.3) was first published in 1973 by Paterson and 
Stockmeyer [182]. It was rediscovered in the context of multiple-precision 
evaluation of elementary functions by Smith [204, §8.7] in 1991. Smith gave it 
the name “concurrent series” ith proposed modular splitting of the series, 
but classical splitting seems slightly better. Smith noticed that the simultaneous 
use of this fast technique and argument reduction yields O(n!/3M(n)) algo- 
rithms. Earlier, in 1960, Estrin [92] had found a similar saa: i with n/2 
non-scalarmultiplications, but O(log n) parallel complexity. 

There ‘ate several variants of the Euler—Maclaurin sum formula, with and 
without bounds on the remainder. See Abramowitz and dealin [1, Ch. 23], 
and Apostol [6], for ae 

t of the asymptotic expansions that we have given in §4.5 may be found 
pee sane and un [1]. For more background on asymptotic expan- 
sions of special functions, see for example the books by de Bruijn [84], 
Olver [180] and Worlg [231]. We hay e omitted mention of many other useful 
asymptotic expansions, for example all but a fej of] those for Bessel functions, 
for which see Olvet [180], Watson [225], Whittaker and Watson [227]. 

Mot of|the continued! fractions mentioned in §4.6 may be found in Abram- 
owitz ang ~Stegun [1]. The classical theory is given in the bopkq by 
Khinchin [139] and Wall [224]. Continued fractions are used in the manner 
described in §4.6 in arbitrary-precision packages such as Brent’s MP [47]. A 
good recent reference arious aspects of continued fractions for the evalu- 
ation of special acai ‘andbook of Continued Fractions for Special 
Functions by Cuyt et al. an particular, Chapter 7 contains a ssi 
of error bounds. Our rem 4.1 is a trivial modification of Cuyt et al! 
Theorem 7.5.1]. The is res fast i suggested in Exercise 4.32 
was given by Schénhage [195+ 

A proof ofa generalization of (4.54) is given in [4, ghby, Miller’s algorithm 
is due to J. . Miller. It is described, for example, in [1, ee ae and 
Clenshaw et al. [68, §13014.. An algorithm is given in Gautschi [102]. 

A recurrence based on (4.60) was used to evaluate the scaled ae 
numbers C;, in the MP package following a suggestion of Reinsch [48, §12]. 


e. 
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Previously, the inferior recurrence (4.59) was-widely used, for example in 


Knuth [140] and in earlylversidns of Brent’s M kage [47, §6.11]. The idea 
of using tangent numbers is mentioned in [107, §6.5], where it is attributed to 
B. F. Logan. Our in-place Algorithms TangentNumbers and SecantNumbers 
may be new (see Exercises 4.38-4.40). Kaneko [135] describes an algorithm 
of Akiyama and Tanigawa for computing Bernoulli numbers in a manner simi- 
lar to “Pascal’s triangle”. However, it requires more arithmetic operations than 
Algorithm TangentNumbers. Also, the Akiyama—Tanigawa algorithm ip only 
recbmntended for exact rational arithmetic, since it is numerically unstable if 
implemented in floating-point arithmetic. For more on Bernoulli, tangent and 
secant numbers, and a connection with Stirling numbers, see Chen [62] and 
Sloane [203, A027641, A000182, A000364]. 

The Staudt—Clausen theorem was proved independently by Karl von 
Staudt and Thomas Clausen in 1840. It can be found in many references. Ifjust 

Cl single pies es ae of large index es then Harvey’s te 
orithm [11 be recommended. 

Some references on the Arithmetic-Geometric Mean (AGM) are-Brent [43, 

1], Salamin [192], the Borweins’ book [36], Arndt and Haekel [7]. An 
sei reference, which includes some results that were_rediscovered later, is 
the fascinhtingkeport KMEM by Beeler, Gosper anc_Schroeppel [15]. Bern- 
stein [19] gives a sartey of different AGM algorithms for computing the log- 
arithm. Eqn. (4.70) is given in Borwein and Borwein [36, (1.3.10)], and the 
bound (4.73) is given in [36, p. 11, Exercise 4(c)]. The AGM cah be extended 
to complex starting values provided we take the correct branch of the square 
[ropt (the one with positive real part): fee Bprweir| ard Borwein [36, pp. 15-16]. 
The use of the complex AGM jisdiscussed in [88]. For theta function identities, 
see [36, Chapter 2], ad for a proof of (4.78), see [36, §2.3]. 

The use of the exact formula (4-79) to compute In x was first suggested by 
Sasaki and Kanada (see [36, (7.2.5)], but beware the typo). See Brent [46] for 
Landen transformations, and Brent [43] for more efficient methods; note that 
the constants ae in those papers might be improved using faster square root 
algorithms (Chapter 3). 

The constants in Table 4.1 are justified as follows. sume we are in 
the FFT domain, and one Fourier transform costs 1 ia 13M (n)/9 & 
1.444M(n) cost for a real reciprocal is from Harvey [116], and assumes 
M(n) ~ 3T(2n), where T'(n) is the time to perform a Fourier transform of 
size n. For the complex reciprocal 1/(v+iw) = (v—iw)/(v? +w?), we com- 
pute v?+w? using two forward transforms and one backward transform, equiv- 
alent in cost to M(n), then one real reciprocal to obtain say x = 1/(v? + w?), 
then two real multiplications to compute vx, wa, but take advantage of the 
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fact that we already know the forward er ee of v and w, and the trans- 
form of x only needs to be computed once, so these two multiplications cost 
only M(n). Thus, the total cost is 311/(n)/9 & 3.444M(n). The 1.666M (n) 
cost for real division is from van der Hoeven [125, Remark 6], and assumes 
M(n) ~ 3T(2n) as above for the real reciprocal. For complex division, say 
(t+iu)/(v+iw), we first compute the complex reciprocal 7-+iy = 1/(v+iw), 
then perform a complex multiplication (¢ + iu)(a + iy), but save the cost 
of two trangforms by observing that the transforms of x and y are known 
as a byproduct of the complex recipropal algorithm. Thus, the total cost is 
(31/9+ 4/3)M(n) © 4.777M (n). The 4M (n) /3 cost for the real square root 
is from Harvey [116], and assumes M(n) ~ 37'(2n) as above. The complex 
square root uses Friedland’s algorithm [97]: /z + 7y = w + iy/(2w), where 
w = / (|x| + (x? + y?)!/2)/2; as for the complex reciprocal, x? + y? costs 
M(n), then we compute its square root in 4M (n)/3, the second square root 
in 4M (n)/3, and the division y/w costs 1.666//(n), which gives a total of 
5.333.M (n). 

The cost of one real AGM iteration is at most the sum of the multiplica- 
tion cost and of the square root cost, but since we typically perform several 
iterations, it is reasonable to assume that the input and output of the iteration 
includes the transforms of the operands. The transform of a + b is obtained by 
linearity from the transforms of a and 8, so is essentially free. Thus, we save 
one transform or M/(n)/3 per iteration, giving A cos per iteration of 2/(n). 
(Another way to save /(n)/3 is to trade the multiplication for a squaring, 
as explained in Schonhage, Grotefeld, and Vetter [198, §8.2.5].) The complex 
AGM is analogous: it costs the same as a complex multiplication (2/ (n)) and 
a complex square root (5.333M (n)), but we can save two (real) transforms per 
iteration (2M (n)/3), giving a net cost of 6.666M(n). Finally, the logarithm 
via the AGM costs 21g(n) + O(1) AGM iterations. 

We note that some of the constants in Table 4.1 may not be optim 
example, it may be possible to reduce the cost of reciprocal or square TOO 
(Harvey, Sergeev). We leave this as a challenge to the reader (see Exercise 4.46). 
Note that the constants for operations on power series may sae a the cor- 
responding constants for operations on integers/reals. 


The idea of binary splitting-is quite old, since in 1976 Brent [45, Th. 6.2] 
gave a binary splitting avout to compute exp x in time O(M(n yn)?), 
See also Borwein and Borwein [36, page 335]. The CLN library i ee 
several functions with binary splitting, see Haible and Papanikolaou [108], and 
is quite efficient for precisions of a million bits or more. 

“bit-burst” algorithn{_whs invented by David and Gregory Chud- 
novsky [65], and our Theorem 4.2 is based on their work. Some references 
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on holonomic functions are J. Bernstein [25, 26], van der Hoeven, [123] and 
Zeilberger [233]. See also the Maple GFUN package [193], which allows one, 
amongst other things, to d the recurrence for the Taylor coefficients of 
f(a) from its differential am a 

There are several topics that are not covered in this chapter, but might have 
been ifLwe had _mofe time and space. We mention bothe references here. A 
useful resource is the website [143]. 

The Riemann zeta-flingtion ¢(s) can be evaluated by the Euler-Maclaurin 
expansion (4.34)-(4.36), or by Borwein’s algorithm [38, 39], but neither of 
these methods is efficient if S(s) is large. On the critical line R(s) =[TPp. the 
seen formula [99] is much faster and in practice sufficiently accu- 
rate, although only an asymptotic expansion. If enough terms are taken, the 
error seems to be O(exp(—at)), where t = S(s): see Brent’s review [ nd 
Berry’s paper [28]. An error analysis is given in [184]. The Riema egel 
coefficients may be defined by a recurrence in terms of certain integers, that 
can be defined using Euler numbers (see Sloane’s sequence A087617 [203]). 
Sloane calls this the Gabcke sequence but Gabcke credits er [155] so 
perhaps it should be called the Lehmer—Gabcke sequence. The-sequence (p,,) 
occurs naturally in the asymptotic expansion of In(I'(1/4 +,i#/2)). The (not 
obvious) fact that the p,, are integers was proved by de Reyn Ir 

Borwein’s algorithm for ¢(s) can be generalized to cover functions such as 
the polylogarithm and the Hurwitz zeta-function: see VepStas [223]. 

Toe te the Riemann zeta-function ¢(o + it) for fixed o and many 
equally spaced point ¢, [ihe fAstest known algorithm is due to Odlyzko and 
Schénhage [179]. It has been used by Odlyzko to compute blocks of zeros with 
very large Height t, see [177, 178]; also (with improvements) by Gourdon to 
verify the Riemann Hypothesis for the first 10/3 non-trivial zeros in the upper 
yal see [105]. The Odlyzko—Schénhage algorithm can be generalized 
for the computation of other L-functions. 

In $4.10, we briefly discussed the numerical approximation of contour inte- 
grals, but i Tt iad other s of numerical quadrature, 
for phe di; ed nh my tanh-sinh rule, etc. Some 
references are [11, 12, 13, 95, 172, 213], an SOT ; p, fi 1 dis- 
cussion of the contour integration method, see-fl ae 
ture (which depends on Richardson extrapol 9, 188, 191]. For 
Clenshaw—Curtis and Gaussian quadrature, see [67, 219]. An example of 
the use of numerical quadrature to evaluate ['(x) is [32, p. 188]. Thid is an 
interesting alternative to the use of Stirling’s asymptotic expansion (84.5). 

We have not discussed the computation of specific mathematical constants 
such as 7, y (Euler’s constant), ¢(3), etc. am can be evaluated using 7 = 
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4arctan(1) anda fastazethn computation (§4.9.2); or by the Gaus Leela 
algorithm (also known as the Brent—Salamin algorithm), see [43, 46, 192]. 
This asymptotically fast algorithm is based on the arithmetic-geometric mean 
and Legendre’s relation (4.72). A recent record computation by Bellard [16] 
used a rapidly converging series for 1/7 by the Chudnovsky brothers [64], 
contined with binary splitting. Its complexity is O(M(n) log” n) (thepret- 
ically worse than Gauss—Legendre’s O(M(n) logn), but with a small con- 
stant factor). There are several popular books on 7: we mention Arndt and 
Haenel [7]. A more advanced book is the one by the Borwein brothers [36]. 

For a clever| iniplementation of binary splitting and its application to the 
fast computation of constants such as 7 and ¢(3) — and more generally con- 
stants defined by hypergeometric series — see Cheng, Hanrot, Thomé, Zima, 
and Zimmermann [63]. 

The co ation of y and its continued fraction is of interest b e it 
is not kno hether y is rational (though this is unlikely). The b go- 
rithm for computing y appears to be the “Bessel function” algorithm of Brent 
and McMillan [54], as modified by Papanikol d later Gourdon [106] to 
incorporate binary splitting. A very useful a ae on the evalua- 
tion of constants (including 7, e, y, In 2, ¢(3)) and certain functions (including 

) and ¢(s)) is Gourdon and Sebah’s web site [106]. 
mn nice book on accurate numerical computations for a diverse set of “SIAM 
100-Digit Challenge” problems is Bornemann, Laurie, Wagon, and Waldvo- 
gel [32]. In particular, Appendix B of this book considers how to solve the 
problems to 10 000-decimal digit accuracy (and succeeds in all cases but one). 
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Implementations and pointers 


Here we present a non-exhaustive list of software packages that 
(in most cases) the authors have tried, together with some other 
useful pointers. Of course, we cannot accept any responsibility 
for bugs/errors/omissions in any of the software or documenta- 
tion mentioned here — caveat emptor! 


Websites change. If any of the websites mentioned here disappear 


in the future, you may be able to find the new site using a search 
engine with appropriate keywords. 


5.1 Software tools 


5.1.1 CLN 


CLN (Class Library for Numbers, http: //www.ginac.de/CLN/) is a 
library for efficient computations with all kinds of numbers in arbitrary preci- 
sion. It was written by Bruno Haible, and is currently maintained by Richard 
Kreckel. It is written in C++ and distributed under the GNU General Public 
License (GPL). CLN provides some elementary apespecial functions, and fast 
arithmetic on large numbers, in particular it imp nts Schénhage-Strassen 
multiplication, and the binary splitting algorithm [108]. CLN can be config- 
ured to use GMP low-level MPN routines, which improves its performance. 


5.1.2 GNU MP (GMP) 


The GNU MP library is the main reference for arbitrary-precision arithmetic. 
It has been developed since 1991 by Torbjérn Granlund and several other con- 
tributors. GNU MP (GMP for short) implements several of the algorithms de- 
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scribed in this book, In particular, we recommend reading the “Algorithms” 


chapter of the GMP reference manual [104]. GMP is written in C, is released 
under thd GNU Lesser General Public License (LGPL), and is available from 
http://gmplib.org/. Oo 

GMP’s MpPZ class implements arbitrary-precision integers (corresponding 
to Chapter 1), while the MPF class implements arbitrary-precision floating- 
point numbers (corresponding to Chapter 3)/ The performance of GMP comes 
mostly from its low-level MPN class, which is well designed and highly opti- 
mized in assembly code for many architectures. 

As of version 5.0.0, implements different multiplication algorithms 
(schoolbook, Karatsuba, Toom—Cook 3-way, 4-way, 6-way, 8-way, and FFT 
using Schénhage-Strassen’s algorithm); its division routine implements Algo- 
rithm RecursiveDivRem (§1.4.3) in the middle range, and beyond that New- 
ton’s method, with complexity O(M/(n)), and so does its square root, which 
implements Algorithm SqrtRem, since it relies on division. The Newton di- 
vision first precomputes a reciproc precision n/2, and then performs two 
steps of Barrett reduction to precisten'n/2: this is an integer variant of Algo- 
rithm Divide. It also implements unbalanced multiplication, with Toom—Cook 
(3,2), (413), 3), (4,2), or (6,3) [31]. Function mpn_ni_invertappr, 
which is not ibJthe public interface, implements Algorithm Approximate- 
Reciprocal (§3.4.1). GMP 5.0.0 dde$ not implement elementary or special 
functions (Chapter 4), nor does it provide modular arithmetic with an invariant 
divisor in its public interface (Chapter 2). However, it contains a preliminary 
interface for Montgomery’s REDC algorithm. 

MPIR is a “fork” of GMP, with a different license, and various other dif- 
ferences that make some functions more efficient with GMP, and some with 
MPIR; also, the difficulty of compiling under Microsoft operating systems may 

insatiable etal the developers of GMP and MPIR are con- 
inuaily improving their code, so the situation is dynamic. For more on MPIR, 


see http: //www.mpir.org/. 


5.1.3 MPFQ 


MPFQ is a software library developed by Pierrick Gaudry and Emmanuel 
Thomé for manipulation of finite fields. What makes MPFQ different from 
other modular arithmetic libraries is that the target finite field is given at com- 
pile time, thus more specific optimizations can be done. The two main targets 
of MPFQ are the Galois fields Fyn and F,, with peprime. MPFOQ is available 


| However, the authors of GMP recommend using MPFR (see §5.1.4) for new projects. 
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from http: //www.mpfq.org/, and is distributed under the GNU Lesser 
General Public License (LGPL). 


5.1.4 GNU MPFR 


GNU MPFR is a multiple-precision binary floating-point library, written in C, 
based on the GNU MP library, and distributed ander the GNU Lesser General 

[ Public License (LGPL). It extends the main ideas of the IEEE 754 standard to 
arbitrary-precision arithmetic, by providing correct rounding and exceptions. 
MPFR implements the algorithms of Chapter 3 and most of those of Chap- 
ter 4, | eine ated asinine itt ‘on defined by ISO C99 standard. 
These strong semantics are in most cases achieved with no significant slow- 
down compared to other arbitrary-precision tools. For details of the MPFR 
library, see http: //www.mpfr.org/ and the paper [96]. 


5.1.5 Other multiple-precision packages 


Without attempting to be exhaustive, we briefly mention some of MPFR’s pre- 
decessors, competitors, and extensions. 


1. ARPREC is a package for multiple-precision floating-point arithmetic, writ- 
ten by David Bailey ef al. in C++/Fortran. The distribution includes The Ex- 
perimental Mathematician’s Toolkit, which is an interactive high-precision 
arithmetic computing environment. ARPREC is available from http: // 
brid. 1b1.gov/~dhbailey/mpdist/. 


2. MP [47] is a package for multiple-precision floating-point arithmetic and el- 
ementary and special function evaluation, written in Fortran77. MP permits 
any small base (3 (subject to restrictions imposed by the word-size), and im- 
plements several rounding modes, though correct ae i | not 
guaranteed in all cases. MP is now obsolete, and we recommend the Use of 
a more modern package such as MPFR. However, much of Chapter 4 was 
inspired by MP, and some of the algorithms implemented in MP are not yet 
available i , so the source code ion may be 
of interest: see http: //rpbrent.com/pub/pub043.html. 


3. MPC (http://www.multiprecision.org/) isaC library for arith- 
metic using complex numbers with arbitrarily high precision and correct 
rounding, written by Andreas Enge, Philippe Théveny, and Paul Zimmer- 
mann [90]. MPC is built on and follows the same principles as MPFR. 
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4. MPFI is a package for arbitrary-precision floating-point interval arithmetic, 
based on MPER. It can be useful to get rigorous error bounds using interval 
arithmetic. See http: //mpfi.gforge.inria.fr/, and also §5.3. 


5. Several other interesting/useful packages are listed under “Other Related 
Free Software” at the MPFR website http: //www.mpfr.org/. 


5.1.6 Computational algebra packages 


There are several general-purpose computational algebra packages that incor- 
porate high-precision or arbitrary-precision arithmetic. These include Magma, 
Mathematica, Maple, and Sage. Of these, Sage is free and open-source; the 
others are either commercial or semi-commercial and not open-source. The 
authors of this book have often used Magma, Maple, and Sage for prototyping 
and testing algorithms, since it is usually faster to develop an algorithm in a 
high-level language (at least if one is familiar with it) than in a low-level lan- 
guage like C, where there are many details to worry about. Of course, if speed 
of execution is a concern, it may be worthwhile to translate the high-level code 


into a a ean a 

the low- : 

1. Magma (http://magma.maths.usyd.edu.au/magma/) was de- 
veloped and is supported by John Cannon’s group at the University of Syd- 
ney. Its predecessor was Cayley, a package designed primarily for compu- 
tational group theory. However, Magma is a general-purpose algebra pack- 


age with logical syntax and clear semantics. It includes arbitrary-precision 
arithmetic based on GMP, MPFR, and MPC. Although Magma is not open- 
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2. Maple (http: //www.maplesoft.com/) is acommercial package orig- 
inally developed at the University of Waterloo, now by Waterloo Maple, 
Inc. It uses GMP for its integer arithmetic (though not necessarily the latest 
version of GMP, so in some cases calling GMP directly may be significantly 
faster). Unlike most of the other software mentioned in this chapter, Maple 
uses radix 10 for its floating-point arithmetic. 


3. Mathematica is a commercial package produced by Stephen Wolfram’s 
company Wolfram Research, Inc. In the past, public documentation on 
the algorithms used internally by Mathematica was poor. However, this 
situation may be improving. Mathematica now appears to use GMP for its 
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basic arithmetic. For information about Mathematica, see http://www. 
wolfram.com/products/mathematica/. 


4. NTL(http://www.shoup.net/nt1/) is aC++ library providing data 
structures and algorithms for manipulating arbitrary-length integers, as well 
as vectors, matrices, and polynomials over the integers and over finite fields. 
For example, it is very efficient for operations on polynomials over the fi- 


nite field "9 (i.e. GF(2)). NTL was written by and is maintained by Victor 


Shoup. 


5. PARI/GP (http: //pari.math.u—bordeaux. fr/) is acomputer al- 
gebra system designed for fast computations in number theory, but also 
able to handle matrices, polynomials, power series, algebraic numbers, etc. 
PARI is implemented as a C library, and GP is the scripting language for 
an interactive shell giving access to the PARI functions. Overall, PARI is a 
small and efficient package. It was originally developed in 1987 by Chris- 
tian Batut, Dominique Bernardi, Henri Cohen, and Michel Olivier at Uni- 
versité Bordeaux I, and is now maintained by Karim Belabas and a team of 
volunteers. 


6. Sage (http: //www.sagemath.org/) is a free, open-source mathe- 
matical software system. It combines the power of many existing open- 
source packages with a common Python-based interface. According to the 
Sage website, its mission is “Creating a viable free open-source alternative 
to Magma, Maple, Mathematica and Matlab”. Sage was started by William 
Stein and is developed by a large team of volunteers. It uses MPIR, MPFR, 
MPC, MPFI, PARI, NTL, etc. Thus, it is a large system, with many capa- 
bilities, but occupying a lot of space and taking a long time to compile. 


5.2, Mailing lists 
5.2.1 The GMP lists 


There are four mailing lists associated with GMP: gmp—bugs for bug reports; 
gmp-announce for important announcements about GMP, in particular new 
releases; gmp-discuss for general discussions about GMP; gmp-devel 
for technical discussions between GMP developers. We recommend subscrip- 
tion to gmp-announce (very low traffic), to gmp—discuss (medium to 
high traffic), and to gmp-devel only if you are interested in the internals of 
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GMP. Information about these lists (including archives and how to subscribe) 
is available from http: //gmplib.org/mailman/listinfo/. 


5.2.2 The MPFR list 


There is only one mailing list for the MPFR library. See http://www. 
mpfr.org/ to subscribe or search through the list archives. 


5.3 On-line documents 


The NIST Digital eal of Naihendneatranmnnne Die ememtintots 
project to completely rewrite s omaiee and Stegun’s classic a 
Mathematical Functions [1]. It i http://dimf.nist.gov 

will also be published in book form by Cambridge University Press. 


The Wolfram Functions Site http://functions.wolfram.com/ 
contains a lot of information about mathematical functions (definition, spe- 


cific values, general characteristics, representations as series, limits, integrals, 
continued fractions, differential equations, transformations, and so 

The Encyclopedia of Special Functions (ESF) is another nice we , whose 
originality is that all formule are automatically generated from very few data 
that uniquely define the corresponding function in a general class [163]. This 
encyclopedia is currently being reimplemented in the Dynamic Dictionary of 
Mathematical Functions (DDMEF); both are available from http: //algo. 
inria.fr/online.html. 

A large amo i i i i ic (i ion, soft- 
ware, languages, books, courses, applications) can be found on the Interval 
Computations page ht i mp /. 

Mike Cowlishaw maintains an extensive bibliography of conversion to and 
from decimal arithmetic at http: //speleotrove.com/decimal/. 

Useful if you want to identify an egg nena pao Nea 
is the Inverse Symbolic Calculator (ISC) by Simon Plouffe (building on 
earlier work by the Borwein brothers) at http: //oldweb.cecm.sfu. 
ca/projects/ISC/. 

Finally, an extremely useful site for all kinds of integer/rational sequences is 
Neil Sloane’s Online Encyclopaedia of Integer Sequences (OEIS) athttp:// 
www.research.att.com/~njas/sequences/. 
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Summary of complexities 


Integer arithmetic (n-bit or (m, n)-bit input) 


Addition, subtraction O(n) 
Multiplication M(n) 
Unbalanced multiplication (m >n) | M(m,n) < [™]M(n), M(™2") 
Division O(M(n)) 
Unbalanced division (with remainder) | D(m-+n,n) = O(M(m,n)) 
Square root O(M(n)) 
kth root (with remainder) O(M(n)) 
GCD, extended GCD, Jacobi symbol O(M(n) log n) 
Base conversion O(M(n) log n) 
Modular arithmetic (n-bit modulus) 
Addition, subtraction O(n) 
Multiplication M(n) 
Division, inversion, conversion to/from RNS | O(M(n) log n) 
Exponentiation (k-bit exponent) O(kM(n)) 


Floating-point arithmetic (n-bit input and output) 
Addition, subtraction O(n) 
Multiplication M(n) 
Division O(M(n)) 
Square root, kth root O(M(n)) 
Base conversion O(M(n) log n) 
Elementary functions 
(in a compact set O(M(n) log n) 
excluding zeros and poles) 


